diff --git a/integrations/pytorch_ddp/DEVELOPMENT.md b/integrations/pytorch_ddp/DEVELOPMENT.md
new file mode 100644
index 00000000..658f057e
--- /dev/null
+++ b/integrations/pytorch_ddp/DEVELOPMENT.md
@@ -0,0 +1,53 @@
+This document explains, what the state of development is at and tries to document some of the decisions made
+
+## Structure
+
+Consists of
+
+- wrapper, bindings and helper functionality found in ./accl_process_group
+- main C++ files in ./src
+- The ACCL repo the process group itself builds on top will be in ./accl . This is replicated such that you can try different versions
+- ./test testscripts
+
+## Build process
+
+Check the ./install.py helper for dependency versions
+
+./setup.py sets up the build
+
+See the section in the README on how to avoid the long build using pip
+
+## Basics
+
+- Currently only runs via Coyote RDMA. XRT and GPU support was dropped. Simulator still runs over XRT UDP though
+- Needs MPI Library to work. Set in setup.py. Tested only with MPICH
+- The test setup in run.sh is for the HACC cluster
+- use ACCL_DEBUG=1 both during build and runs
+- Everything runs in rendezvous mode
+- if you call collectives directly they are run synchronously, but eg allreduce used internally in DDP is executed async
+- The PG allocates 2 buffers and reuses them to avoid reallocation. This is supposed to be replaced with a host buffer constructor which takes an existing memory region. To change buffer type you need to use the change_buffer_type branch(maybe already pulled) at https://github.com/lawirz/ACCL 
+- The torch profiler can see the overall execution time, but setting it up to measure sub-operation within the workerthread was attempted but failed.
+
+## ProcessGroupACCL.cpp
+
+### ProcessGroup structure
+
+A lot of the design comes from the ProcessGroupMPI. There is a concept of WorkEntries, which schedule Work on a separate worker thread. This is currently done using a single Worker thread as is the case with the MPI PG. There is still a lock, probably only relevant in case of a few management operations from the DDP side. With async execution in ACCL, we could try a different structure with AsyncWork as is done on Gloo PG I think.
+
+### Collectives
+
+- There are small wrappers, which do a few checks mostly copied from MPI PG, do the sidestep then setup the WorkEntry
+- The WorkEntries manage the Segmentation, which is not yet correctly implemented everywhere. Some collectives still use a version which relies on the input to have one-dimensional shape. Others, which require multiple Segmentations such as Scatter have similar limitations
+- Input is copied to the pre-allocated buffer. Generally copies using memcpy seem to be much faster, than using tensor.copy_ for some reason
+- ACCL does a host-to-host call. The driver figures out, that it's host to host using the buffer type. The compressed type should be added as an argument to make that work again
+- copy back
+
+## Hardware issues
+
+A lot of collectives still fail in hardware. The following can produce issues
+
+- Mixing datatypes especially ints
+- High variablity in length
+- MPI sidestepping(can't explain why this causes issues)
+
+If you run test-resnet50, you will encounter them.
diff --git a/integrations/pytorch_ddp/README.md b/integrations/pytorch_ddp/README.md
index cc998d45..94a0d07d 100644
--- a/integrations/pytorch_ddp/README.md
+++ b/integrations/pytorch_ddp/README.md
@@ -12,6 +12,8 @@ python3 -m venv venv
 source venv/bin/activate
 ```
 
+Activate an XRT 21 version. Later versions led to issues before.
+
 <details><summary>Installation without GPU support</summary>
   To install the plugin without GPU support, simply run the following from within the venv:
 
@@ -42,8 +44,42 @@ source venv/bin/activate
 </details>
 
 ## Running the plugin
+
 Make sure to source the `setup.sh` script in this directory to load the ACCL plugin before starting a Python script.
 Example usage can be found in the various test files under [`test/`](test).
 
 Do make sure not to run python from within the root directory of `pytorch_ddp`, because Python will try to import the
 local incomplete [`accl_process_group/`](accl_process_group) folder instead of the actual installation.
+
+The provided `test/run.sh` will launch a testscript via mpirun
+
+## Setup overview
+
+- The whole Processgroup is wrapped in OpenMPI, which is used for initialization
+- You can use the OpenMPI implementation of certain collectives using the "sidestep" flags in the ProcessGroupACCL.cpp
+- Recompilation using `./install` or `pip install .` can be very slow, you can run `python setup.py build_ext --inplace` and then copy the binary or other files directly. `cp accl_process_group/_c/ProcessGroupACCL.cpython-38-x86_64-linux-gnu.so ~/.local/lib/python3.8/site-packages/accl_process_group/_c/`
+- The `install.py` script will not reinstall the driver in case of ACCL updates. You will need to rebuild it yourself
+- Set `ACCL_DEBUG=1` if you want more output(also set during build). Stdout is sometimes not complete(in simulator), so best log most things in stderr
+- The runscript currently just outputs the command to be run(better not use the `&` at the end), which you then run manually. This is because I had bad experiences with the missing output(maybe coinciding with issues mentioned above) and termination on multiple machines, but should also work if you comment the `exit 0` and the `&` at the end of mpirun out. Don't forget, that you should still run the script to clear log files.
+- ACCL only supports sizes up to 4MB, If you give it tensors of higher sizes, the PG will try to segment it in first dim. Not all collectives correctly handle multi-dimensional tensors yet. 
+- Setting up the simulator with 4MB takes long, better set it lower for quick tests.
+- You can init the process group as if it were udp and run on a `cyt_rdma` simulator
+- There is no reason to not support the rdma + SIM initialization. It just hasn't been implemented yet. Certain case-splits assume no-sim if cyt_rdma is given...
+
+### How to install torchvision
+
+- install torch using the script
+- clone vision, go to the fitting version v0.16.0
+- clone libpng, configure with prefix set to local directory
+- add the bin to the path
+- not sure if needed: supply the path of the library and include to torchvision as in their development doc
+- disable the version check in torchvision setup.py, because it doesn't correctly parse the version.
+- run vision setup.py with debug, include, library and use png flags
+
+### Tests available
+Check `test/run.sh` for ACCL_SCRIPT examples
+
+- `test-generic.py` tests everything in isolation + a small dual layer model learning a linear function
+- `test-mnist.py` should be able to be run non-distributed as well(check arguments)
+- `test-imagenet.py` does finetuning of Resnet50 according to: <https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html> and should alse be able to be run non-distributed
+- For DLRM you will need to use a small fork of the DLRM-repo with ACCL-support hosted at <https://gitlab.ethz.ch/lawirz/dlrm>. It contains a `run.sh`
diff --git a/integrations/pytorch_ddp/accl_process_group/__init__.py b/integrations/pytorch_ddp/accl_process_group/__init__.py
index cac971eb..55098ff2 100644
--- a/integrations/pytorch_ddp/accl_process_group/__init__.py
+++ b/integrations/pytorch_ddp/accl_process_group/__init__.py
@@ -17,5 +17,5 @@
 
 from ._c.ProcessGroupACCL import ProcessGroupACCL, Rank, DataType, ACCLDesign
 from .process_group_wrapper import create_process_group, \
-    create_process_group_coyote, create_simulate_process_group, initialize, \
+    initialize, \
     set_compression, get_compression, get_local_qp, set_remote_qp
diff --git a/integrations/pytorch_ddp/accl_process_group/process_group_wrapper.py b/integrations/pytorch_ddp/accl_process_group/process_group_wrapper.py
index 06979add..2c346f73 100644
--- a/integrations/pytorch_ddp/accl_process_group/process_group_wrapper.py
+++ b/integrations/pytorch_ddp/accl_process_group/process_group_wrapper.py
@@ -19,23 +19,29 @@
 from typing import Optional
 from . import ProcessGroupACCL, Rank, DataType, ACCLDesign
 import torch
+import logging
 from torch.distributed import Backend
 from torch.distributed.distributed_c10d import ProcessGroup, Store
-
+import sys
+import os
 
 process_group: Optional[ProcessGroupACCL] = None
 
+#Configure logging
+logger = logging.getLogger(__name__)
+if "ACCL_DEBUG" in os.environ and os.environ["ACCL_DEBUG"]=="1":
+    logger.setLevel(logging.DEBUG)
+else:
+    logger.setLevel(logging.WARNING)
 
 def create_process_group(
-        ranks: list[Rank],
-        xclbin: str, device_index: int, design: ACCLDesign,
+        ranks: list[Rank], design: ACCLDesign,
         *, nbufs: int = 16, bufsize: int = 1024,
         compression: Optional[dict[DataType, DataType]] = None,
         p2p_enabled: bool = False, profiling_ranks: Optional[list[int]] = None,
         profiling_timeout: float = 0.0, rsfec: bool = False,
+        simulation: bool = False,
         initialize: bool = True) -> ProcessGroup:
-    if design == ACCLDesign.cyt_rdma or design == ACCLDesign.cyt_tcp:
-        raise RuntimeError(f"{design} is an incompatible design for XRT")
 
     if compression is None:
         compression = {}
@@ -43,44 +49,14 @@ def create_process_group(
         # Copy compression since it will be used later in the lambda function
         compression = compression.copy()
 
+    logger.debug(f'Compression: {compression}')
+
     if profiling_ranks is None:
         profiling_ranks = []
     else:
         profiling_ranks = profiling_ranks.copy()
 
-    def create_process_group_wrapper(store, rank, size, _timeout):
-        global process_group
-        if process_group is not None:
-            raise RuntimeError("ACCL ProcessGroup already created, "
-                               "can only create one.")
-
-        pg = ProcessGroupACCL(store, rank, size, ranks, False, design,
-                              xclbin=xclbin, device_index=device_index,
-                              bufsize=bufsize, rsfec=rsfec, nbufs=nbufs,
-                              compression=compression,
-                              p2p_enabled=p2p_enabled,
-                              profiling_ranks=profiling_ranks,
-                              profiling_timeout=profiling_timeout)
-
-        process_group = pg
-        if initialize:
-            pg.initialize()
-
-        return pg
-
-    Backend.register_backend("ACCL", create_process_group_wrapper)
-
-def create_simulate_process_group(ranks: list[Rank], *,
-                                  nbufs: int = 16, udp: bool = False,
-                                  compression: Optional[dict[DataType,
-                                                             DataType]] = None,
-                                  bufsize: int = 1024,
-                                  initialize: bool = True) -> ProcessGroup:
-    if compression is None:
-        compression = {}
-    else:
-        # Copy compression since it will be used later in the lambda function
-        compression = compression.copy()
+    logger.debug(f'Profiling_ranks: {profiling_ranks}')        
 
     def create_process_group_wrapper(store, rank, size, _timeout):
         global process_group
@@ -88,48 +64,13 @@ def create_process_group_wrapper(store, rank, size, _timeout):
             raise RuntimeError("ACCL ProcessGroup already created, "
                                "can only create one.")
 
-        design = ACCLDesign.udp if udp else ACCLDesign.tcp
+        # if simulation:
+            #overwrite the design choice in simulation
+            # design = ACCLDesign.udp
 
-        pg = ProcessGroupACCL(store, rank, size, ranks, True, design,
-                              compression=compression, nbufs=nbufs,
-                              bufsize=bufsize)
-
-        process_group = pg
-        if initialize:
-            pg.initialize()
-
-        return pg
-
-    Backend.register_backend("ACCL", create_process_group_wrapper)
-
-def create_process_group_coyote(
-        ranks: list[Rank], design: ACCLDesign,
-        *, nbufs: int = 16, bufsize: int = 1024,
-        compression: Optional[dict[DataType, DataType]] = None,
-        p2p_enabled: bool = False, profiling_ranks: Optional[list[int]] = None,
-        profiling_timeout: float = 0.0, rsfec: bool = False,
-        initialize: bool = False) -> ProcessGroup:
-    if design != ACCLDesign.cyt_rdma and design != ACCLDesign.cyt_tcp:
-        raise RuntimeError(f"{design} is an incompatible design for coyote")
-
-    if compression is None:
-        compression = {}
-    else:
-        # Copy compression since it will be used later in the lambda function
-        compression = compression.copy()
-
-    if profiling_ranks is None:
-        profiling_ranks = []
-    else:
-        profiling_ranks = profiling_ranks.copy()
-
-    def create_process_group_wrapper(store, rank, size, _timeout):
-        global process_group
-        if process_group is not None:
-            raise RuntimeError("ACCL ProcessGroup already created, "
-                               "can only create one.")
-
-        pg = ProcessGroupACCL(store, rank, size, ranks, False, design,
+        logger.debug(f'Creating ProcessGroupACCL for: rank {rank}')
+    
+        pg = ProcessGroupACCL(store, rank, size, ranks, simulation, design,
                               bufsize=bufsize, rsfec=rsfec, nbufs=nbufs,
                               compression=compression,
                               p2p_enabled=p2p_enabled,
@@ -138,31 +79,37 @@ def create_process_group_wrapper(store, rank, size, _timeout):
 
         process_group = pg
         if initialize:
+            logger.debug('Initializing Process Group')
             pg.initialize()
-
         return pg
 
-    Backend.register_backend("ACCL", create_process_group_wrapper)
+    #CPU only for now
+    logger.debug('Registering ACCL Backend')
+    Backend.register_backend("ACCL", create_process_group_wrapper, devices='cpu')
 
 def initialize() -> None:
+    logger.debug('Initialize called')
     if process_group is None:
         raise RuntimeError("Cannot initialize before ACCL ProcessGroup "
                            "is created.")
     process_group.initialize()
 
 def get_local_qp(rank: int) -> list[int]:
+    logger.debug('Get_local_qp called')
     if process_group is None:
         raise RuntimeError("Cannot get local qp before ACCL ProcessGroup "
                            "is created.")
     return process_group.get_local_qp(rank)
 
 def set_remote_qp(rank: int, qp: list[int]) -> None:
+    logger.debug('Set_remote_qp called')
     if process_group is None:
         raise RuntimeError("Cannot set remote qp before ACCL ProcessGroup "
                            "is created.")
     return process_group.set_remote_qp(rank, qp)
 
 def set_compression(compression: dict[DataType, DataType]):
+    logger.debug(f'Setting compression to {compression}')
     if process_group is None:
         raise RuntimeError("Cannot set compression before ACCL ProcessGroup "
                            "is initialized.")
diff --git a/integrations/pytorch_ddp/include/ProcessGroupACCL.hpp b/integrations/pytorch_ddp/include/ProcessGroupACCL.hpp
index 4218ad36..04d6f2c7 100644
--- a/integrations/pytorch_ddp/include/ProcessGroupACCL.hpp
+++ b/integrations/pytorch_ddp/include/ProcessGroupACCL.hpp
@@ -266,17 +266,21 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
 
   void run_send(at::Tensor tensor, int dstRank, int tag);
   void run_recv(at::Tensor tensor, int rcvRank, int tag);
-  void run_broadcast(at::Tensor tensor, const BroadcastOptions &opts);
-  void run_allreduce(at::Tensor tensor, const AllreduceOptions &opts);
-  void run_reduce(at::Tensor tensor, const ReduceOptions &opts);
-  void run_allgather(at::Tensor srctensor,
+  void run_broadcast(at::Tensor in_tensor, const BroadcastOptions &opts);
+  void run_allreduce(at::Tensor in_tensor, const AllreduceOptions &opts);
+  void run_reduce(at::Tensor in_tensor, const ReduceOptions &opts);
+  void run_allgather(at::Tensor in_tensor,
                      const std::vector<at::Tensor> &dsttensors);
-  void run_gather(at::Tensor srctensor,
+  void run_gather(at::Tensor in_tensor,
                   const std::vector<at::Tensor> &dsttensors,
                   const GatherOptions &opts);
-  void run_scatter(std::vector<at::Tensor> &srctensors, at::Tensor dsttensor,
+  void run_scatter(std::vector<at::Tensor> &in_tensors, at::Tensor dsttensor,
                    const ScatterOptions &opts);
-  void run_alltoall(at::Tensor srctensor, at::Tensor dsttensor, const AllToAllOptions &opts);
+
+  void run_alltoall(at::Tensor in_tensor, at::Tensor dsttensor, const AllToAllOptions &opts);
+  
+  void run_alltoall_vec(std::vector<at::Tensor> &in_tensor_vec,
+                                    std::vector<at::Tensor> &out_tensor_vec, const AllToAllOptions &opts);
 
   ACCL::dataType get_compressed_type(c10::ScalarType datatype);
 
@@ -292,6 +296,17 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
   // Global states
   static void initACCLOnce();
   static void acclExit();
+  
+  void init_input_tensor(at::Tensor &tensor, std::unique_ptr<ACCL::BaseBuffer> &data, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+  
+  void init_input_tensor_new(at::Tensor &tensor, ACCL::BaseBuffer *data, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+  
+  void init_input_data_vec(std::vector<at::Tensor> &tensor_vec, std::unique_ptr<ACCL::BaseBuffer> &data, const at::TensorOptions &options, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+  
+  void copy_back_tensor(at::Tensor tensor_original, std::unique_ptr<ACCL::BaseBuffer> &data, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+
+  void copy_back_tensorvec(const std::vector<at::Tensor> &dsttensorvec, std::unique_ptr<ACCL::BaseBuffer> &data, at::Tensor &dsttensor, int numel, int offset, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+  
   static std::once_flag onceFlagInitACCL;
 
   static std::mutex pgGlobalMutex_;
@@ -309,6 +324,7 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
 
   ACCL::CoyoteDevice *cyt_device;
   std::vector<fpga::ibvQpConn*> ibvQpConn_vec;
+  xrt::device xrt_device;
 
   std::unique_ptr<ACCL::ACCL> accl;
   uint64_t bufsize;
@@ -318,6 +334,9 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
   bool initialized;
   xrt::bo buf0;
   xrt::bo buf1;
+
+  std::unique_ptr<ACCL::BaseBuffer> in_buf;
+  std::unique_ptr<ACCL::BaseBuffer> out_buf;
 };
 
 } // namespace c10d
diff --git a/integrations/pytorch_ddp/include/coyote_init.hpp b/integrations/pytorch_ddp/include/coyote_init.hpp
deleted file mode 100644
index 61989a15..00000000
--- a/integrations/pytorch_ddp/include/coyote_init.hpp
+++ /dev/null
@@ -1,12 +0,0 @@
-#pragma once
-#include <vector>
-
-#include <accl.hpp>
-
-namespace coyote_init {
-void setup_cyt_rdma(std::vector<fpga::ibvQpConn*> &ibvQpConn_vec,
-                    std::vector<ACCL::rank_t> &ranks, int local_rank,
-                    ACCL::CoyoteDevice &device);
-void configure_cyt_rdma(std::vector<fpga::ibvQpConn*> &ibvQpConn_vec,
-                        std::vector<ACCL::rank_t> &ranks, int local_rank);
-} // namespace coyote_init
diff --git a/integrations/pytorch_ddp/install.py b/integrations/pytorch_ddp/install.py
index b97ab21f..e1366a7d 100755
--- a/integrations/pytorch_ddp/install.py
+++ b/integrations/pytorch_ddp/install.py
@@ -112,7 +112,7 @@ def install_pytorch(rocm: bool = False, cuda: bool = False):
 
 def install_accl_driver(accl_driver_path: Path):
     print("Installing accl driver...")
-    if 'ACCL_DEBUG' in os.environ:
+    if 'ACCL_DEBUG' in os.environ and os.environ["ACCL_DEBUG"]=="1":
         extra_args = ['-DACCL_DEBUG=1']
     else:
         extra_args = []
@@ -121,7 +121,7 @@ def install_accl_driver(accl_driver_path: Path):
     subprocess.run(['make'], cwd=accl_driver_path, check=True)
 
 
-def install_accl_process_group(rocm: bool = False, cuda: bool = False):
+def install_accl_process_group(rocm: bool = False, cuda: bool = False, debug: bool = False):
     if not accl_driver_path.exists():
         clone_accl()
     if not accl_driver.exists():
@@ -131,12 +131,14 @@ def install_accl_process_group(rocm: bool = False, cuda: bool = False):
     env = os.environ.copy()
     env['USE_ROCM'] = '1' if rocm else '0'
     env['USE_CUDA'] = '1' if cuda else '0'
+    if debug:
+        env['ACCL_DEBUG'] = '1'
     subprocess.run([python, '-m', 'pip', '-v', 'install', '.'],
                    env=env, cwd=root, check=True)
 
 
 def main(rocm: bool = False, cuda: bool = False,
-         force_accl_process_group: bool = False, force_pytorch: bool = False):
+         force_accl_process_group: bool = False, force_pytorch: bool = False, debug: bool = False):
     packages = test_packages()
 
     if force_pytorch and torch_dir.exists():
@@ -150,10 +152,14 @@ def main(rocm: bool = False, cuda: bool = False,
               "please rerun with the --force-pytorch flag enabled.")
         exit(1)
 
-    if not packages['accl-process-group'] or force_accl_process_group:
+    if not packages['accl-process-group']:
         print("ACCL Process Group not found, installing...")
-        install_accl_process_group(rocm, cuda)
+        install_accl_process_group(rocm, cuda, debug)
 
+    if force_accl_process_group:
+        print("Forced reinstall of ACCL Process Group ")
+        install_accl_process_group(rocm, cuda, debug)
+        
 
 if __name__ == '__main__':
     import argparse
@@ -165,21 +171,23 @@ def main(rocm: bool = False, cuda: bool = False,
         'ProcessGroup in the current virtual environment.\nWill also install '
         'PyTorch if it isn\'t installed already.')
     gpu_support = parser.add_mutually_exclusive_group()
-    gpu_support.add_argument('--rocm', action='store_true',
+    gpu_support.add_argument('-r','--rocm', action='store_true',
                              help='Installs the Process Group with ROCm '
                              'support.')
-    gpu_support.add_argument('--cuda', action='store_true',
+    gpu_support.add_argument('-c','--cuda', action='store_true',
                              help='Installs the Process Group with CUDA '
                              'support.')
-    parser.add_argument('--force-accl-process-group', action='store_true',
+    parser.add_argument('-a','--force-accl-process-group', action='store_true',
                         help='Force a reinstall of the ACCL Process Group')
-    parser.add_argument('--force-pytorch', action='store_true',
+    parser.add_argument('-t','--force-pytorch', action='store_true',
                         help='Force a reinstall of PyTorch '
                         f'{CURRENT_PYTORCH_VERSION} with the correct CXX11 ABI'
                         ' settings applied.')
     parser.add_argument('-f', '--force', action='store_true',
                         help='Enables both --force-accl-process-group and '
                         '--force-pytorch.')
+    parser.add_argument('-d', '--debug', action='store_true',
+                        help='Will print ACCL debugging info using ACCL_DEBUG=1')
 
     args = parser.parse_args()
     if args.force:
@@ -188,7 +196,7 @@ def main(rocm: bool = False, cuda: bool = False,
 
     try:
         main(args.rocm, args.cuda, args.force_accl_process_group,
-             args.force_pytorch)
+             args.force_pytorch, args.debug)
     except KeyboardInterrupt:
         print("Cancelled installation")
         exit(1)
diff --git a/integrations/pytorch_ddp/setup.py b/integrations/pytorch_ddp/setup.py
index 4f16452e..d7998451 100755
--- a/integrations/pytorch_ddp/setup.py
+++ b/integrations/pytorch_ddp/setup.py
@@ -41,27 +41,25 @@
 accl_utils_dir = driver_dir / 'utils' / 'accl_network_utils'
 vnx_dir = root / 'accl' / 'test' / 'refdesigns' / 'xup_vitis_network_example' \
     / 'xrt_host_api'
-roce_dir = root / 'accl' / 'test' / 'refdesigns' / 'HiveNet' \
-    / 'network' / 'roce_v2' / 'xrt_utils'
 
 include_dirs = [root / 'include',  driver_dir / 'xrt' / 'include',
                 accl_utils_dir / 'include', xrt_dir / 'include',
                 root / 'accl' / 'test' / 'model' / 'zmq',
-                vnx_dir / 'include', roce_dir / 'include',
+                vnx_dir / 'include',
                 root / 'accl' / 'test' / 'refdesigns' / 'Coyote' / 'sw' / 'include',
+                '/pub/scratch/zhe/mpich/install/include',
                 '/usr/include/jsoncpp']
-library_dirs = [driver_dir / 'xrt' / 'lib', xrt_dir / 'lib']
+library_dirs = [driver_dir / 'xrt' / 'lib', xrt_dir / 'lib', '/mnt/scratch/zhe/mpich/install/lib/libmpicxx.so']
 libraries = ['accl', 'jsoncpp', 'zmq']
 sources = [root / 'src' / 'ProcessGroupACCL.cpp',
-           root / 'src' / 'coyote_init.cpp',
            vnx_dir / 'src' / 'cmac.cpp', vnx_dir / 'src' / 'networklayer.cpp',
-           roce_dir / 'src' / 'cmac.cpp', roce_dir / 'src' / 'hivenet.cpp',
            accl_utils_dir / 'src' / 'accl_network_utils.cpp']
 
 compile_args = ['-Wno-reorder',
                 '-Wno-sign-compare',
                 '-Wno-unused-but-set-variable',
                 '-DACCL_HARDWARE_SUPPORT',
+                '-DACCL_NETWORK_UTILS_MPI',
                 '-std=c++17',
                 '-g']
 
diff --git a/integrations/pytorch_ddp/src/ProcessGroupACCL.cpp b/integrations/pytorch_ddp/src/ProcessGroupACCL.cpp
index e0c3e596..8ee9f012 100644
--- a/integrations/pytorch_ddp/src/ProcessGroupACCL.cpp
+++ b/integrations/pytorch_ddp/src/ProcessGroupACCL.cpp
@@ -23,6 +23,8 @@
 #include <memory>
 #include <stdexcept>
 #include <string>
+#include <signal.h>
+#include <mpi.h>
 
 #ifdef ACCL_PROCESS_GROUP_HIP_ENABLED
 #include "hip/hip_runtime.h"
@@ -37,24 +39,87 @@
 
 #include <accl.hpp>
 #include <accl_network_utils.hpp>
-#include "coyote_init.hpp"
 
-namespace cyt = coyote_init;
 namespace py = pybind11;
+using namespace ACCL;
 
 namespace c10d {
 
-#define CEIL_DIV(x, y) ((x) / (y) + ((x) % (y) != 0))
+// Toggles to run Collectives via OpenMPI instead(To sidestep any issues with them in ACCL)
+// The sidestep-code is copied from the ProcessGroupMPI
+// #define SCATTER_SIDESTEP
+// #define GATHER_SIDESTEP
+// #define ALLGATHER_SIDESTEP
+
+#define BROADCAST_SIDESTEP false
+// #define BROADCAST_SIDESTEP true
+
+    
+#define ALLREDUCE_SIDESTEP false
+// #define ALLREDUCE_SIDESTEP true
+
+// #define SIDESTEP_BCAST_WITH_ALLREDUCE
+    
+#define RDVZ_THRESHOLD 64
+
+// This is the maximal message size. larger sizes get segmented
+#define ACCL_MSG_SIZE 2097152
+
+// counts are rounded up to this number for stability reasons
+#define ROUND_NR 256
 
-#define ACCL_ERROR(status)                                                     \
-  ("ACCL error in: " + std::string(__FILE__) + ":" +                           \
-   std::to_string(__LINE__) + ", with error code: " + std::to_string(status))
+// This is intended for debugging, you can refer to the name of the collective using this
+#define COLL_NAME UNNAMED
+
+#define x_MAKE_STRING(s) MAKE_STRING(s)
+#define MAKE_STRING(s) #s    
+
+// Used in sidestepping
+#define MPI_CHECK(cmd)                                                   \
+  do {                                                                   \
+    int mpiStatus = cmd;                                                 \
+    if (mpiStatus != MPI_SUCCESS) {                                      \
+      std::string err = "MPI error in: " + std::string(__FILE__) + ":" + \
+          std::to_string(__LINE__) +                                     \
+          ", with error code: " + std::to_string(mpiStatus);             \
+      TORCH_CHECK(false, err);                                           \
+    }                                                                    \
+  } while (0)    
+
+// Used in sidestepping    
+// Op mapping
+std::map<ReduceOp::RedOpType, MPI_Op> mpiOp = {
+    {ReduceOp::MIN, MPI_MIN},
+    {ReduceOp::MAX, MPI_MAX},
+    {ReduceOp::SUM, MPI_SUM},
+    {ReduceOp::PRODUCT, MPI_PROD},
+};
+// Used in sidestepping
+// Type mapping
+std::map<at::ScalarType, MPI_Datatype> mpiDatatype = {
+    {at::kByte, MPI_UNSIGNED_CHAR},
+    {at::kChar, MPI_CHAR},
+    {at::kDouble, MPI_DOUBLE},
+    {at::kFloat, MPI_FLOAT},
+    {at::kInt, MPI_INT},
+    {at::kLong, MPI_LONG},
+    {at::kShort, MPI_SHORT},
+};
+    
+#define CEIL_DIV(x, y) ((x) / (y) + ((x) % (y) != 0))
 
 #if defined(ACCL_PROCESS_GROUP_HIP_ENABLED) &&                                 \
     defined(ACCL_PROCESS_GROUP_CUDA_ENABLED)
 #error Cannot compile Process Group with both HIP and CUDA support
 #endif // ACCL_PROCESS_GROUP_HIP_ENABLED && ACCL_PROCESS_GROUP_CUDA_ENABLED
 
+#define DO_COND ((do_on_root && opts_root_rank == rank_) || (do_on_others && opts_root_rank != rank_))
+
+#define PRE_REQUEST(opname, tensor)					\
+  in_buf->change_type(convert_datatype_from_torch(tensor.scalar_type())); \
+  out_buf->change_type(convert_datatype_from_torch(tensor.scalar_type()));   \
+  ACCL::debug("Performing " #opname " of " + std::to_string(tensor.numel()) + " items")
+
 namespace {
 
 /* Alternative for std::format from C++20 in C++17.
@@ -73,34 +138,6 @@ std::string string_format(const std::string &format, Args... args) {
                      buf.get() + size - 1); // We don't want the '\0' inside
 }
 
-template <typename val_t>
-std::string format_array(val_t *data, std::size_t size, std::size_t breakval = 3) {
-  std::ostringstream buffer;
-  buffer << "[";
-  if (size <= breakval * 2 + 1) {
-    for (std::size_t i = 0; i < size; ++i) {
-      buffer << data[i];
-      if (i + 1 != size) {
-        buffer << ", ";
-      }
-    }
-  } else {
-    for (std::size_t i = 0; i < breakval; ++i) {
-      buffer << data[i] << ", ";
-    }
-    buffer << "..., ";
-    for (std::size_t i = size - breakval; i < size; ++i) {
-      buffer << data[i];
-      if (i + 1 != size) {
-        buffer << ", ";
-      }
-    }
-  }
-  buffer << "]";
-
-  return buffer.str();
-}
-
 // Op mapping
 std::map<ReduceOp, ACCL::reduceFunction> acclOp = {
     {ReduceOp::SUM, ACCL::reduceFunction::SUM},
@@ -116,6 +153,7 @@ std::map<at::ScalarType, ACCL::dataType> acclDatatype = {
     {at::kShort, ACCL::dataType::int32},
 };
 
+
 // Checking the input tensor's validity
 void checkSingleTensorHelper(const at::Tensor &tensor) {
   if (!tensor.is_contiguous()) {
@@ -197,6 +235,40 @@ const char *convert_datatype_to_torch(ACCL::dataType torch_type) {
   }
 }
 
+const char *string_of_accl_datatype(ACCL::dataType accl_type) {
+  switch (accl_type) {
+  case ACCL::dataType::float16:
+    return "ACCL::dataType::float16";
+  case ACCL::dataType::float32:
+    return "ACCL::dataType::float32";
+  case ACCL::dataType::float64:
+    return "ACCL::dataType::float64";
+  case ACCL::dataType::int32:
+    return "ACCL::dataType::int32";
+  case ACCL::dataType::int64:
+    return "ACCL::dataType::int64";
+  default:
+    return "unknown";
+  }
+}
+
+const char *string_of_torch_datatype(c10::ScalarType torch_type) {
+  switch (torch_type) {
+  case at::kHalf:
+      return "torch.float16";
+  case at::kFloat:
+    return "torch.float32";
+  case at::kDouble:
+    return "torch.float64";
+  case at::kInt:
+    return "torch.int32";
+  case at::kLong:
+    return "torch.int64";
+  default:
+    return "unknown";
+  }
+}
+  
 std::map<ACCL::dataType, ACCL::dataType> convert_compression_from_dict(
     const std::map<std::string, std::string> &dictionary) {
   std::map<ACCL::dataType, ACCL::dataType> map;
@@ -221,140 +293,6 @@ std::map<std::string, std::string> convert_compression_to_dict(
   return dictionary;
 }
 
-// Create an ACCL Buffer with correct type
-std::unique_ptr<ACCL::BaseBuffer> create_buffer(ACCL::ACCL &accl, size_t length,
-                                                c10::ScalarType type) {
-  switch (type) {
-  case at::kInt:
-    return accl.create_buffer<int32_t>(length, acclDatatype.at(type));
-  case at::kLong:
-    return accl.create_buffer<int64_t>(length, acclDatatype.at(type));
-  case at::kFloat:
-    return accl.create_buffer<float>(length, acclDatatype.at(type));
-  case at::kDouble:
-    return accl.create_buffer<double>(length, acclDatatype.at(type));
-  default:
-    TORCH_CHECK(false, "Tensor has unsupported datatype");
-    break;
-  }
-}
-
-// Create an ACCL Buffer with correct type
-std::unique_ptr<ACCL::BaseBuffer> create_coyotebuffer(ACCL::ACCL &accl, size_t length,
-                                                c10::ScalarType type) {
-  switch (type) {
-  case at::kInt:
-    return accl.create_coyotebuffer<int32_t>(length, acclDatatype.at(type));
-  case at::kLong:
-    return accl.create_coyotebuffer<int64_t>(length, acclDatatype.at(type));
-  case at::kFloat:
-    return accl.create_coyotebuffer<float>(length, acclDatatype.at(type));
-  case at::kDouble:
-    return accl.create_coyotebuffer<double>(length, acclDatatype.at(type));
-  default:
-    TORCH_CHECK(false, "Tensor has unsupported datatype");
-    break;
-  }
-}
-
-// Create an ACCL Buffer with correct type
-std::unique_ptr<ACCL::BaseBuffer> wrap_buffer(ACCL::ACCL &accl, xrt::bo &bo,
-                                              size_t length,
-                                              c10::ScalarType type) {
-  size_t size;
-  if (type == at::kInt || type == at::kFloat) {
-    size = length * 4;
-  } else {
-    size = length * 8;
-  }
-  xrt::bo slice = xrt::bo(bo, size, static_cast<size_t>(0));
-  switch (type) {
-  case at::kInt:
-    return accl.create_buffer<int32_t>(slice, length, acclDatatype.at(type));
-  case at::kLong:
-    return accl.create_buffer<int64_t>(slice, length, acclDatatype.at(type));
-  case at::kFloat:
-    return accl.create_buffer<float>(slice, length, acclDatatype.at(type));
-  case at::kDouble:
-    return accl.create_buffer<double>(slice, length, acclDatatype.at(type));
-  default:
-    TORCH_CHECK(false, "Tensor has unsupported datatype");
-    break;
-  }
-}
-
-// Create an ACCL P2P Buffer with correct type
-std::unique_ptr<ACCL::BaseBuffer>
-create_buffer_p2p(ACCL::ACCL &accl, size_t length, c10::ScalarType type) {
-  switch (type) {
-  case at::kInt:
-    return accl.create_buffer_p2p<int32_t>(length, acclDatatype.at(type));
-  case at::kLong:
-    return accl.create_buffer_p2p<int64_t>(length, acclDatatype.at(type));
-  case at::kFloat:
-    return accl.create_buffer_p2p<float>(length, acclDatatype.at(type));
-  case at::kDouble:
-    return accl.create_buffer_p2p<double>(length, acclDatatype.at(type));
-  default:
-    TORCH_CHECK(false, "Tensor has unsupported datatype");
-    break;
-  }
-}
-
-std::unique_ptr<ACCL::BaseBuffer> create_buffer_p2p(ACCL::ACCL &accl,
-                                                    const at::Tensor &tensor) {
-  return create_buffer_p2p(accl, tensor.numel(), tensor.scalar_type());
-}
-
-// Create an ACCL Buffer with correct type from Tensor
-std::unique_ptr<ACCL::BaseBuffer> create_buffer(ACCL::ACCL &accl,
-                                                const at::Tensor &tensor) {
-  std::unique_ptr<ACCL::BaseBuffer> buffer;
-  switch (tensor.scalar_type()) {
-  case at::kInt:
-    buffer = accl.create_buffer(static_cast<int32_t *>(tensor.data_ptr()),
-                                tensor.numel(),
-                                acclDatatype.at(tensor.scalar_type()));
-
-    ACCL::debug("Creating int32 buffer at 0x" +
-                ACCL::debug_hex(buffer->address()) + " of " +
-                std::to_string(buffer->size()) + "B.");
-    break;
-  case at::kLong:
-    buffer = accl.create_buffer(static_cast<int64_t *>(tensor.data_ptr()),
-                                tensor.numel(),
-                                acclDatatype.at(tensor.scalar_type()));
-
-    ACCL::debug("Creating int64 buffer at 0x" +
-                ACCL::debug_hex(buffer->address()) + " of " +
-                std::to_string(buffer->size()) + "B.");
-    break;
-  case at::kFloat:
-    buffer = accl.create_buffer(static_cast<float *>(tensor.data_ptr()),
-                                tensor.numel(),
-                                acclDatatype.at(tensor.scalar_type()));
-
-    ACCL::debug("Creating float32 buffer at 0x" +
-                ACCL::debug_hex(buffer->address()) + " of " +
-                std::to_string(buffer->size()) + "B.");
-
-    break;
-  case at::kDouble:
-    buffer = accl.create_buffer(static_cast<double *>(tensor.data_ptr()),
-                                tensor.numel(),
-                                acclDatatype.at(tensor.scalar_type()));
-
-    ACCL::debug("Creating float64 buffer at 0x" +
-                ACCL::debug_hex(buffer->address()) + " of " +
-                std::to_string(buffer->size()) + "B.");
-    break;
-  default:
-    TORCH_CHECK(false, "Tensor has unsupported datatype");
-    break;
-  }
-
-  return buffer;
-}
 
 // Check if process is compiled with HIP support
 inline bool hip_enabled() {
@@ -374,149 +312,6 @@ inline bool cuda_enabled() {
 #endif
 }
 
-// Check if tensor is a GPU tensor, the ProcessGroup is compiled with GPU
-// support, ACCL is not running in simulation mode, and the ProcessGroup was
-// initialized with p2p_enabled
-bool p2p_applicable(ACCL::ACCL &accl, const at::Tensor &tensor,
-                    bool p2p_enabled) {
-  auto type = tensor.device().type();
-  if (type != c10::DeviceType::CPU && p2p_enabled && !accl.is_simulated()) {
-    if (type == c10::DeviceType::HIP) {
-      return hip_enabled();
-    } else if (type == c10::DeviceType::CUDA) {
-      // HIP tensors will identify themselves as CUDA tensor depending on the
-      // initialization, so we have to see CUDA tensors as HIP tensors if
-      // ProcessGroup is compiled with HIP support
-#ifdef ACCL_PROCESS_GROUP_HIP_ENABLED
-      return hip_enabled();
-#else
-      return cuda_enabled();
-#endif
-    }
-  }
-  return false;
-}
-
-// Copy a GPU tensor to a P2P FPGA buffer
-void copy_to_p2p_buffer(ACCL::BaseBuffer &buffer, const at::Tensor &tensor) {
-  if (tensor.device().type() == c10::DeviceType::HIP) {
-    ACCL::debug("Syncing HIP GPU buffer to FPGA");
-#ifdef ACCL_PROCESS_GROUP_HIP_ENABLED
-    hipMemcpy(buffer.byte_array(), tensor.data_ptr(), tensor.nbytes(),
-              hipMemcpyDeviceToHost);
-#else
-    TORCH_CHECK(false, "ACCL ProcessGroup is build without HIP support");
-#endif
-  } else if (tensor.device().type() == c10::DeviceType::CUDA) {
-#ifdef ACCL_PROCESS_GROUP_HIP_ENABLED
-    ACCL::debug("Syncing HIP GPU buffer to FPGA");
-    hipMemcpy(buffer.byte_array(), tensor.data_ptr(), tensor.nbytes(),
-              hipMemcpyDeviceToHost);
-#else
-    ACCL::debug("Syncing CUDA GPU buffer to FPGA");
-#ifdef ACCL_PROCESS_GROUP_CUDA_ENABLED
-    cudaMemcpy(buffer.byte_array(), tensor.data_ptr(), tensor.nbytes(),
-               cudaMemcpyDeviceToHost);
-#else
-    TORCH_CHECK(false, "ACCL ProcessGroup is build without CUDA support");
-#endif // ACCL_PROCESS_GROUP_CUDA_ENABLED
-#endif // ACCL_PROCESS_GROUP_HIP_ENABLED
-  }
-}
-
-// Create a new FPGA P2P buffer and copy contents of GPU tensor
-inline std::unique_ptr<ACCL::BaseBuffer>
-create_and_copy_p2p_buffer(ACCL::ACCL &accl, const at::Tensor &tensor) {
-  ACCL::debug("Creating p2p buffer of size " + std::to_string(tensor.nbytes()));
-  std::unique_ptr<ACCL::BaseBuffer> buffer =
-      create_buffer_p2p(accl, tensor.numel(), tensor.scalar_type());
-  copy_to_p2p_buffer(*buffer, tensor);
-  return buffer;
-}
-
-// Copy results from an FPGA P2P buffer back to the GPU tensor
-void copy_back_p2p_buffer(ACCL::BaseBuffer &buffer, const at::Tensor &tensor) {
-  if (tensor.device().type() == c10::DeviceType::HIP) {
-    ACCL::debug("Syncing HIP GPU buffer from FPGA");
-#ifdef ACCL_PROCESS_GROUP_HIP_ENABLED
-    hipMemcpy(tensor.data_ptr(), buffer.byte_array(), tensor.nbytes(),
-              hipMemcpyHostToDevice);
-#else
-    TORCH_CHECK(false, "ACCL ProcessGroup is build without HIP support");
-#endif
-  } else if (tensor.device().type() == c10::DeviceType::CUDA) {
-#ifdef ACCL_PROCESS_GROUP_HIP_ENABLED
-    ACCL::debug("Syncing HIP GPU buffer from FPGA");
-    hipMemcpy(tensor.data_ptr(), buffer.byte_array(), tensor.nbytes(),
-              hipMemcpyHostToDevice);
-#else
-    ACCL::debug("Syncing CUDA GPU buffer from FPGA");
-#ifdef ACCL_PROCESS_GROUP_CUDA_ENABLED
-    cudaMemcpy(tensor.data_ptr(), buffer.byte_array(), tensor.nbytes(),
-               cudaMemcpyHostToDevice);
-#else
-    TORCH_CHECK(false, "ACCL ProcessGroup is build without CUDA support");
-#endif // ACCL_PROCESS_GROUP_CUDA_ENABLED
-#endif // ACCL_PROCESS_GROUP_HIP_ENABLED
-  }
-}
-
-bool check_arp(vnx::Networklayer &network_layer,
-               std::vector<ACCL::rank_t> &ranks, int rank, int size) {
-  std::map<unsigned, bool> ranks_checked;
-  for (unsigned i = 0; i < static_cast<unsigned>(size); ++i) {
-    ranks_checked[i] = false;
-  }
-
-  bool sanity_check = true;
-  const std::map<int, std::pair<std::string, std::string>> arp =
-      network_layer.read_arp_table(size);
-
-  std::ostringstream ss_arp;
-  ss_arp << "ARP table:";
-
-  for (const std::pair<const int, std::pair<std::string, std::string>> &elem :
-       arp) {
-    const unsigned index = elem.first;
-    const std::pair<std::string, std::string> &entry = elem.second;
-    const std::string &mac = entry.first;
-    const std::string &ip = entry.second;
-    ss_arp << "\n(" << index << ") " << mac << ": " << ip;
-
-    for (unsigned i = 0; i < static_cast<unsigned>(size); ++i) {
-      if (ranks[i].ip == ip) {
-        if (ranks_checked[i]) {
-          std::cerr << "Double entry for " << ip << " in arp table!"
-                    << std::endl;
-          sanity_check = false;
-        } else {
-          ranks_checked[i] = true;
-        }
-      }
-    }
-  }
-
-  ACCL::debug(ss_arp.str());
-
-  if (!sanity_check) {
-    return false;
-  }
-
-  unsigned hosts = 0;
-  for (unsigned i = 0; i < static_cast<unsigned>(size); ++i) {
-    if (ranks_checked[i]) {
-      hosts += 1;
-    }
-  }
-  if (hosts < static_cast<unsigned>(size) - 1) {
-    std::cerr << "Found only " << hosts << " hosts out of " << size - 1 << "!"
-              << std::endl;
-    return false;
-  }
-
-  return true;
-}
-
 } // namespace
 
 ACCL::dataType ProcessGroupACCL::get_compressed_type(c10::ScalarType datatype) {
@@ -580,6 +375,66 @@ std::vector<ACCL::rank_t> convert_ranks(
   return accl_ranks;
 }
 
+// just for the sa_handler
+std::unique_ptr<::ACCL::ACCL>* global_accl;  
+
+void accl_sa_handler(int)
+{
+	static bool once = true;
+	if(once) {
+		global_accl->reset();
+		// std::cout << "Error! Signal received. Finalizing MPI..." << std::endl;
+		// MPI_Finalize();
+		// std::cout << "Done. Terminating..." << std::endl;
+		once = false;
+	}
+	exit(EXIT_FAILURE);
+}
+
+void ProcessGroupACCL::init_input_tensor(at::Tensor &tensor, std::unique_ptr<ACCL::BaseBuffer> &data, bool do_on_root, bool do_on_others, int opts_root_rank) {
+  if DO_COND {
+	std::memcpy(data->byte_array(), tensor.data_ptr(), tensor.numel() * tensor.element_size());
+	if (!coyote_enabled) {
+	    data->sync_to_device();
+	}
+    }
+}
+
+void ProcessGroupACCL::init_input_data_vec(std::vector<at::Tensor> &tensor_vec, std::unique_ptr<ACCL::BaseBuffer> &data, const at::TensorOptions &options, bool do_on_root, bool do_on_others, int opts_root_rank) {
+  if DO_COND {
+    int64_t tens_size = static_cast<size_t>(tensor_vec[0].numel());
+    int64_t total_size = tens_size * static_cast<size_t>(size_);
+      
+    for (const auto i : c10::irange(tensor_vec.size())) {
+	std::memcpy(data->byte_array() + i * tens_size * tensor_vec[0].element_size(), tensor_vec[i].data_ptr(), tens_size * tensor_vec[0].element_size());
+    }
+    if (!coyote_enabled) {
+      data->sync_to_device();
+    }
+  }
+}  
+
+void ProcessGroupACCL::copy_back_tensor(at::Tensor tensor_original, std::unique_ptr<ACCL::BaseBuffer> &data, bool do_on_root, bool do_on_others, int opts_root_rank){
+  if DO_COND {
+      if (!coyote_enabled) {
+	data->sync_from_device();
+      }
+      std::memcpy(tensor_original.data_ptr(), data->byte_array(), tensor_original.numel() * tensor_original.element_size());
+  }
+}
+
+void ProcessGroupACCL::copy_back_tensorvec(const std::vector<at::Tensor> &dsttensorvec, std::unique_ptr<ACCL::BaseBuffer> &data, at::Tensor &dsttensor, int numel, int offset, bool do_on_root, bool do_on_others, int opts_root_rank){
+  if DO_COND {
+    if (!coyote_enabled) {
+      data->sync_from_device();
+    }
+    for (const auto i : c10::irange(dsttensorvec.size())) {
+	std::memcpy(dsttensorvec[i].data_ptr(), data->byte_array() + i * offset * dsttensor.element_size(), numel * dsttensor.element_size());
+    }
+  }
+}  
+    
+
 // Initialize ACCL
 ProcessGroupACCL::ProcessGroupACCL(
     const c10::intrusive_ptr<::c10d::Store> &store, int rank, int size,
@@ -598,83 +453,87 @@ ProcessGroupACCL::ProcessGroupACCL(
         || design == accl_network_utils::acclDesign::CYT_TCP),
       compression(compression), initialized(false) {
 
+  ACCL::debug("Process Group constructor called");
+  
+  struct sigaction sa;
+  memset(&sa, 0, sizeof(sa));
+  sa.sa_handler = accl_sa_handler;
+  sigfillset(&sa.sa_mask);
+  sigaction(SIGINT,&sa,NULL);
+  sigaction(SIGSEGV, &sa, NULL);
+  
   if (std::find(profiling_ranks.begin(), profiling_ranks.end(), rank) !=
       profiling_ranks.end()) {
     std::this_thread::sleep_for(
         std::chrono::duration<double>(profiling_timeout));
   }
-
+  
+  ACCL::debug("Converting ranks");
   ranks_ = convert_ranks(ranks);
-
-  if (coyote_enabled) {
-    if (design == accl_network_utils::acclDesign::CYT_TCP) {
-      cyt_device = new ACCL::CoyoteDevice();
-    } else if (design == accl_network_utils::acclDesign::CYT_RDMA) {
-      cyt_device = new ACCL::CoyoteDevice(size_);
-      cyt::setup_cyt_rdma(ibvQpConn_vec, ranks_, rank_, *cyt_device);
-    } else {
-      throw std::runtime_error("Undefined ACCL design");
+  design_ = design;
+  MPI_Barrier(MPI_COMM_WORLD);
+  if (!simulator){
+    if (coyote_enabled) {
+      if (design_ == accl_network_utils::acclDesign::CYT_TCP) {
+        cyt_device = new ACCL::CoyoteDevice();
+	accl_network_utils::configure_cyt_tcp(ranks_, rank_, cyt_device);
+      } else if (design_ == accl_network_utils::acclDesign::CYT_RDMA) {
+	ACCL::debug("Creating CoyoteDevice");
+        cyt_device = new ACCL::CoyoteDevice(size_);
+	accl_network_utils::configure_cyt_rdma(ranks_, rank_, cyt_device);
+      } else {
+        throw std::runtime_error("Undefined ACCL design");
+      }
+    }
+    else{
+      xrt_device = xrt::device(device_index);
     }
   }
 }
 
-std::vector<std::uint8_t> ProcessGroupACCL::get_local_qp(unsigned int rank) {
-  std::vector<std::uint8_t> qp;
-  char *data = (char *) &ibvQpConn_vec[rank]->getQpairStruct()->local;
-  for (std::size_t i = 0; i < sizeof(fpga::ibvQ); ++i) {
-    qp.push_back(data[i]);
-  }
-
-  return qp;
-}
-
-void ProcessGroupACCL::set_remote_qp(unsigned int rank, std::vector<std::uint8_t> &qp) {
-  fpga::ibvQ remote_qp;
-  char *data = (char *) &remote_qp;
-  for (std::size_t i = 0; i < sizeof(fpga::ibvQ); ++i) {
-    data[i] = qp[i];
-  }
-
-  ibvQpConn_vec[rank]->getQpairStruct()->remote = remote_qp;
-}
-
 void ProcessGroupACCL::initialize() {
-  xrt::device device;
+  std::cout << "PG initialize called\n";
   if (initialized) {
     throw std::runtime_error("Already initialized process group");
   }
 
-  if (coyote_enabled) {
-    if (design_ == accl_network_utils::acclDesign::CYT_RDMA) {
-      cyt::configure_cyt_rdma(ibvQpConn_vec, ranks_, rank_);
-    } else {
-      throw std::runtime_error("Coyote configure not implemented");
-    }
+  if (coyote_enabled && !simulator_) {
 
-    accl = std::make_unique<ACCL::ACCL>(cyt_device, ranks_, rank_, size_ + 2,
-                                        bufsize, bufsize, 8388608UL);
+    accl = std::make_unique<ACCL::ACCL>(cyt_device);
+    global_accl = &accl;
+
+    // Rendezvous protocol for now
+    int segsize = 4096 * 1024;
+
+    
+    accl.get()->initialize(ranks_, rank_, 16, 1024, RDVZ_THRESHOLD, 4096*1024);
+    
     ACCL::debug(std::string("[ACCL coyote] communicator: ") + accl->dump_communicator());
-  } else {
-    if (!simulator_) {
-      device = xrt::device(device_index_);
-    }
 
+
+
+  } else {
+    // ACCL::debug(std::string("Error XRT initialization deprecated"));
     accl = accl_network_utils::initialize_accl(ranks_, rank_,
-                                               simulator_, design_, device,
+                                               simulator_, design_, xrt_device,
                                                xclbin_, nbufs_, bufsize, 0,
                                                rsfec_);
+    ACCL::debug(std::string("Setting timeout and Threshold"));
+    accl->set_timeout(1e6);
+    // accl->set_rendezvous_threshold(16*1024);
+                                      
     int devicemem = accl->devicemem();
-    if (!simulator_) {
-      // Initialize cache buffers
-      buf0 = xrt::bo(device, bufsize, devicemem);
-      buf1 = xrt::bo(device, bufsize, devicemem);
-    }
+
   }
 
+  in_buf = accl->create_buffer_host<float>(bufsize/sizeof(float), ACCL::dataType::float32);
+  out_buf = accl->create_buffer_host<float>(bufsize/sizeof(float), ACCL::dataType::float32);
+  
   accl->set_timeout(1e8);
   // Start the worker thread accepting ACCL calls
   workerThread_ = std::thread(&ProcessGroupACCL::runLoop, this);
   initialized = true;
+  ACCL::debug(std::string("Finished Initialization"));
 }
 
 ProcessGroupACCL::~ProcessGroupACCL() { destroy(); }
@@ -683,6 +542,12 @@ void ProcessGroupACCL::destroy() {
   std::unique_lock<std::mutex> lock(pgMutex_);
   queueConsumeCV_.wait(lock, [&] { return queue_.empty(); });
 
+  // TODO free other buffer types
+  if (!simulator_) {
+      in_buf->free_buffer();
+      out_buf->free_buffer();
+  }
+
   // Queue is empty, signal stop
   stop_ = true;
 
@@ -746,79 +611,59 @@ c10::intrusive_ptr<Work> ProcessGroupACCL::enqueue(
   queueProduceCV_.notify_one();
   return work;
 }
-
-void ProcessGroupACCL::run_broadcast(at::Tensor tensor_original,
+#undef COLL_NAME
+#define COLL_NAME Broadcast
+void ProcessGroupACCL::run_broadcast(at::Tensor in_tensor,
                                      const BroadcastOptions &opts) {
-  at::Tensor *tensor = &tensor_original;
-  at::Tensor empty_tensor;
-  std::unique_ptr<ACCL::BaseBuffer> data;
+
+  std::chrono::time_point<std::chrono::high_resolution_clock> start_inner  = std::chrono::high_resolution_clock::now();
+
+  // This is very experimental
+  #ifdef SIDESTEP_BCAST_WITH_ALLREDUCE
+  // It seems to have issues with non-even numbers, so we round to ACCL_MSG_SIZE
+  int rounded_count = (in_tensor.numel() + ROUND_NR) & ~ROUND_NR;
+  
+  int imaginary_count = rounded_count;
+  if (in_tensor.scalar_type() == at::kDouble || in_tensor.scalar_type() == at::kLong){
+      imaginary_count = (in_tensor.numel()*2 + ROUND_NR) & ~ROUND_NR;
+  }
+
+  auto zero_tensor = torch::zeros({imaginary_count}, at::kInt);
+  if (opts.rootRank == rank_){
+      init_input_tensor(in_tensor, in_buf, true, false, opts.rootRank);
+  }
+  else{
+      init_input_tensor(zero_tensor, in_buf, false, true, opts.rootRank);
+  }
+  init_input_tensor(zero_tensor, out_buf, true, false, opts.rootRank);
 
   // Reserve device
-  c10::DeviceGuard guard(tensor->device());
+  c10::DeviceGuard guard(in_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Copy data from GPU to FPGA if necessary
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    data = create_and_copy_p2p_buffer(*accl, tensor_original);
-  } else {
-    if (coyote_enabled) {
-      // Copy tensor to CPU tensor first
-      data = create_coyotebuffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-      if (rank_ == opts.rootRank) {
-        tensor->copy_(tensor_original);
-      }
-    } else if (tensor_original.device().type() != c10::DeviceType::CPU) {
-      // Copy tensor to CPU tensor first
-      data = create_buffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-      if (rank_ == opts.rootRank) {
-        tensor->copy_(tensor_original);
-      }
-    } else {
-      data = create_buffer(*accl, *tensor);
-    }
-  }
+  PRE_REQUEST(Broadcast, in_tensor);
+  
+  auto req = accl->allreduce(*in_buf, *out_buf, imaginary_count, ACCL::reduceFunction::SUM);      
 
-  // Run broadcast
-  ACCL::debug("Starting broadcast of " + std::to_string(tensor->numel()) +
-              " items");
+  copy_back_tensor(in_tensor, out_buf, true, true);
+  
+  #else
 
-  if (!coyote_enabled && rank_ == opts.rootRank) {
-    data->sync_to_device();
+  int rounded_count = (in_tensor.numel() + ROUND_NR) & ~ROUND_NR;
+  
+  if (opts.rootRank == rank_){
+      init_input_tensor(in_tensor, in_buf, true, false, opts.rootRank);
   }
 
-  accl->bcast(*data, tensor->numel(), opts.rootRank, ACCL::GLOBAL_COMM, true,
-              true, get_compressed_type(tensor->scalar_type()));
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
+  c10::DeviceGuard guard(in_tensor.device());
+  std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  if (!coyote_enabled && rank_ != opts.rootRank) {
-    data->sync_from_device();
-  }
+  PRE_REQUEST(Broadcast, in_tensor);
+      
+  auto req = accl->bcast(*in_buf, rounded_count, opts.rootRank);
 
-  // Copy results back to GPU if necessary
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    copy_back_p2p_buffer(*data, tensor_original);
-  } else if (coyote_enabled || tensor_original.device().type() != c10::DeviceType::CPU) {
-    ACCL::debug("Copying data back from CPU tensor of size " +
-                std::to_string(tensor_original.numel()));
-    if (rank_ != opts.rootRank) {
-      tensor_original.copy_(*tensor);
-    }
-  }
+  copy_back_tensor(in_tensor, in_buf, false, true, opts.rootRank);
+  #endif
 }
 
 c10::intrusive_ptr<Work>
@@ -827,17 +672,34 @@ ProcessGroupACCL::broadcast(std::vector<at::Tensor> &tensors,
   checkSingleTensor(tensors);
   std::function<void(std::unique_ptr<WorkEntry> &)> runFunc =
       [opts, this](std::unique_ptr<WorkEntry> &entry) {
-        at::Tensor &tensor = (entry->src)[0];
+	if (BROADCAST_SIDESTEP){
+	    
+	auto data = (entry->src)[0];
+	ACCL::debug("[Broadcast] -- Sidestepped using OpenMPI -- size " + std::to_string(data.numel()));
+	c10::DeviceGuard guard(data.device());
+        std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+        MPI_CHECK(MPI_Bcast(
+            data.data_ptr(),
+            data.numel(),
+            mpiDatatype.at(data.scalar_type()),
+            opts.rootRank,
+            MPI_COMM_WORLD));
+	} else {
+	at::Tensor &tensor = (entry->src)[0];
         // Segment data if necessary
-        if (tensor.nbytes() > bufsize) {
-          size_t n = bufsize / tensor.itemsize();
-          for (size_t i = 0; i < tensor.numel(); i += n) {
-            size_t end = std::min(i + n, static_cast<size_t>(tensor.numel()));
-            run_broadcast(tensor.slice(0, i, end), opts);
+        if (tensor.nbytes() > ACCL_MSG_SIZE) {
+	  size_t non_zero_dim_count = tensor.numel() / tensor.size(0);
+          size_t n = ACCL_MSG_SIZE / tensor.itemsize() / non_zero_dim_count;
+	  ACCL::debug("[Broadcast] Segmenting tensor of size " + std::to_string(tensor.nbytes()) + " into " + std::to_string(n * non_zero_dim_count) + "-sized elements ");
+          for (size_t i = 0; i < tensor.size(0); i += n) {
+	    ACCL::debug("part " + std::to_string(i) + "!");
+            size_t end = std::min(n, static_cast<size_t>(tensor.size(0)) - i);
+            run_broadcast(tensor.narrow(0, i, end), opts);
           }
         } else {
           run_broadcast(tensor, opts);
         }
+	}
       };
   auto entry =
       std::make_unique<WorkEntry>(&tensors, &tensors, std::move(runFunc));
@@ -845,72 +707,23 @@ ProcessGroupACCL::broadcast(std::vector<at::Tensor> &tensors,
                  c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-void ProcessGroupACCL::run_allreduce(at::Tensor tensor_original,
+#undef COLL_NAME
+#define COLL_NAME Allreduce
+
+void ProcessGroupACCL::run_allreduce(at::Tensor in_tensor,
                                      const AllreduceOptions &opts) {
-  at::Tensor *tensor = &tensor_original;
-  at::Tensor empty_tensor;
-  std::unique_ptr<ACCL::BaseBuffer> data;
-  std::unique_ptr<ACCL::BaseBuffer> result;
+
+  init_input_tensor(in_tensor, in_buf, true, true);
 
   // Reserve device
-  c10::DeviceGuard guard(tensor->device());
+  c10::DeviceGuard guard(in_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+  PRE_REQUEST(Allreduce,in_tensor); 
+  int rounded_count = (in_tensor.numel() + ROUND_NR) & ~ROUND_NR;
+  
+  auto req = accl->allreduce(*in_buf, *out_buf, rounded_count, acclOp.at(opts.reduceOp));      
 
-  // Copy data from GPU to FPGA if necessary, and create a new result buffer,
-  // since ACCL doesn't support in-place allreduce
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    data = create_and_copy_p2p_buffer(*accl, tensor_original);
-    result = create_buffer_p2p(*accl, tensor->numel(), tensor->scalar_type());
-  } else {
-    if (accl->is_simulated() || coyote_enabled) {
-      data = create_buffer(*accl, tensor->numel(), tensor->scalar_type());
-    } else {
-      data = wrap_buffer(*accl, buf0, tensor->numel(), tensor->scalar_type());
-    }
-    ACCL::debug("Copying data to aligned CPU tensor of size " +
-                std::to_string(tensor_original.numel()));
-    empty_tensor = torch::from_blob(
-        data->byte_array(), tensor_original.sizes(),
-        tensor_original.options().device(c10::DeviceType::CPU));
-    tensor = &empty_tensor;
-    tensor->copy_(tensor_original);
-    ACCL::debug("Creating extra result buffer of size " +
-                std::to_string(tensor_original.numel()));
-    if (accl->is_simulated() || coyote_enabled) {
-      result = create_buffer(*accl, tensor->numel(), tensor->scalar_type());
-    } else {
-      result = wrap_buffer(*accl, buf1, tensor->numel(), tensor->scalar_type());
-    }
-  }
-
-  // Run allreduce
-  ACCL::debug("Starting allreduce of " + std::to_string(tensor->numel()) +
-              " items");
-  if (!coyote_enabled) {
-    data->sync_to_device();
-  }
-  accl->allreduce(*data, *result, tensor->numel(), acclOp.at(opts.reduceOp),
-                  ACCL::GLOBAL_COMM, true, true,
-                  get_compressed_type(tensor->scalar_type()));
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
-
-  if (!coyote_enabled) {
-    result->sync_from_device();
-  }
-
-  // Copy result buffer back to original tensor
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    copy_back_p2p_buffer(*result, tensor_original);
-  } else {
-    ACCL::debug("Copying result data back to original tensor of size " +
-                std::to_string(tensor_original.numel()));
-    tensor_original.copy_(torch::from_blob(
-        result->byte_array(), tensor_original.sizes(),
-        tensor_original.options().device(c10::DeviceType::CPU)));
-  }
+  copy_back_tensor(in_tensor, out_buf, true, true);
 }
 
 c10::intrusive_ptr<Work>
@@ -920,17 +733,34 @@ ProcessGroupACCL::allreduce(std::vector<at::Tensor> &tensors,
 
   std::function<void(std::unique_ptr<WorkEntry> &)> runFunc =
       [opts, this](std::unique_ptr<WorkEntry> &entry) {
-        auto tensor = (entry->src)[0];
-        // Segment data if necessary
-        if (tensor.nbytes() > bufsize) {
-          size_t n = bufsize / tensor.itemsize();
-          for (size_t i = 0; i < tensor.numel(); i += n) {
-            size_t end = std::min(i + n, static_cast<size_t>(tensor.numel()));
-            run_allreduce(tensor.slice(0, i, end), opts);
-          }
-        } else {
-          run_allreduce(tensor, opts);
-        }
+	  if (ALLREDUCE_SIDESTEP){
+	    auto data = (entry->src)[0];
+	    ACCL::debug("[Allreduce] -- Sidestepped using OpenMPI -- size " + std::to_string(data.numel()));
+	    c10::DeviceGuard guard(data.device());
+	    std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+	    MPI_CHECK(MPI_Allreduce(
+			  MPI_IN_PLACE,
+			  data.data_ptr(),
+			  data.numel(),
+			  mpiDatatype.at(data.scalar_type()),
+			  mpiOp.at(opts.reduceOp),
+			  MPI_COMM_WORLD));
+	} else {
+	    auto tensor = (entry->src)[0];
+	    // Segment data if necessary
+	    if (tensor.nbytes() > (ACCL_MSG_SIZE)) {
+		size_t non_zero_dim_count = tensor.numel() / tensor.size(0);
+		size_t n = ACCL_MSG_SIZE / (tensor.itemsize() * non_zero_dim_count);
+		ACCL::debug("[Allreduce] Segmenting tensor of size " + std::to_string(tensor.nbytes()) + " into " + std::to_string(n * non_zero_dim_count) + "-sized elements ");
+		for (size_t i = 0; i < tensor.size(0); i += n) {
+		    // ACCL::debug("part " + std::to_string(i) + "!");
+		    size_t end = std::min(n, static_cast<size_t>(tensor.size(0)) - i);
+		    run_allreduce(tensor.narrow(0, i, end), opts);
+		}
+	    } else {
+		run_allreduce(tensor, opts);
+	    }
+      }
       };
   auto entry =
       std::make_unique<WorkEntry>(&tensors, &tensors, std::move(runFunc));
@@ -944,91 +774,25 @@ ProcessGroupACCL::allreduce_coalesced(std::vector<at::Tensor> &tensors,
   TORCH_CHECK(false,
               "allreduce_coalesced is currently not supported with ACCL");
 }
-
-void ProcessGroupACCL::run_reduce(at::Tensor tensor_original,
+#undef COLL_NAME
+#define COLL_NAME Reduce
+void ProcessGroupACCL::run_reduce(at::Tensor in_tensor,
                                   const ReduceOptions &opts) {
-  at::Tensor *tensor = &tensor_original;
-  at::Tensor empty_tensor;
-  std::unique_ptr<ACCL::BaseBuffer> data;
-  std::unique_ptr<ACCL::BaseBuffer> result;
 
-  // Reserve device
-  c10::DeviceGuard guard(tensor->device());
-  std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Copy data from GPU to FPGA if necessary, and create a new result buffer,
-  // since ACCL doesn't support in-place reduce
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    data = create_and_copy_p2p_buffer(*accl, tensor_original);
+  init_input_tensor(in_tensor, in_buf, true, true);
 
-    if (rank_ == opts.rootRank) {
-      result = create_buffer_p2p(*accl, tensor->numel(), tensor->scalar_type());
-    }
-  } else {
-    if (coyote_enabled) {
-      // Copy tensor to CPU tensor first
-      data = create_coyotebuffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-      tensor->copy_(tensor_original);
-    } else if (tensor_original.device().type() != c10::DeviceType::CPU) {
-      data = create_buffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-      tensor->copy_(tensor_original);
-    } else {
-      data = create_buffer(*accl, *tensor);
-    }
+  // Reserve device
+  c10::DeviceGuard guard(in_tensor.device());
+  std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-    if (rank_ == opts.rootRank) {
-      ACCL::debug("Creating extra result buffer of size " +
-                  std::to_string(tensor_original.numel()));
-      if (coyote_enabled) {
-        result = create_coyotebuffer(*accl, tensor->numel(), tensor->scalar_type());
-      } else {
-        result = create_buffer(*accl, tensor->numel(), tensor->scalar_type());
-      }
-    }
-  }
+  PRE_REQUEST(Reduce,in_tensor);
 
-  // Run reduce
-  ACCL::debug("Starting reduce of " + std::to_string(tensor->numel()) +
-              " items");
-  if (!coyote_enabled) {
-    data->sync_to_device();
-  }
-  accl->reduce(*data, *result, tensor->numel(), opts.rootRank,
-               acclOp.at(opts.reduceOp), ACCL::GLOBAL_COMM, true, true,
-               get_compressed_type(tensor->scalar_type()));
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
+  int rounded_count = (in_tensor.numel() + ROUND_NR) & ~ROUND_NR;
 
-  if (!coyote_enabled && rank_ == opts.rootRank) {
-    result->sync_from_device();
-  }
+  auto req = accl->reduce(*in_buf, *out_buf, rounded_count, opts.rootRank, acclOp.at(opts.reduceOp));
 
-  // Copy result buffer back to original tensor
-  if (rank_ == opts.rootRank) {
-    if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-      copy_back_p2p_buffer(*result, tensor_original);
-    } else {
-      ACCL::debug("Copying back results to original tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      tensor_original.copy_(torch::from_blob(
-          result->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU)));
-    }
-  }
+  copy_back_tensor(in_tensor, out_buf, true, false, opts.rootRank);
 }
 
 c10::intrusive_ptr<Work>
@@ -1043,8 +807,8 @@ ProcessGroupACCL::reduce(std::vector<at::Tensor> &tensors,
         if (tensor.nbytes() > bufsize) {
           size_t n = bufsize / tensor.itemsize();
           for (size_t i = 0; i < tensor.numel(); i += n) {
-            size_t end = std::min(i + n, static_cast<size_t>(tensor.numel()));
-            run_reduce(tensor.slice(0, i, end), opts);
+            size_t end = std::min(n, static_cast<size_t>(tensor.numel()) - i);
+            run_reduce(tensor.narrow(0, i, end), opts);
           }
         } else {
           run_reduce(tensor, opts);
@@ -1056,101 +820,24 @@ ProcessGroupACCL::reduce(std::vector<at::Tensor> &tensors,
                  c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
+#undef COLL_NAME
+#define COLL_NAME Allgather
 void ProcessGroupACCL::run_allgather(
-    at::Tensor srctensor_original,
+    at::Tensor in_tensor,
     const std::vector<at::Tensor> &dsttensorvec) {
-  at::Tensor *srctensor = &srctensor_original;
-  at::Tensor empty_srctensor;
-  std::unique_ptr<ACCL::BaseBuffer> srcdata;
-  at::Tensor dsttensor;
-  std::unique_ptr<ACCL::BaseBuffer> dstdata;
-
-  // Reserve device
-  c10::DeviceGuard guard(srctensor->device());
+    
+  init_input_tensor(in_tensor, in_buf, true, true);    
+  c10::DeviceGuard guard(in_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Copy data from GPU to FPGA if necessary
-  if (p2p_applicable(*accl, srctensor_original, p2p_enabled)) {
-    srcdata = create_and_copy_p2p_buffer(*accl, srctensor_original);
-  } else {
-    if (coyote_enabled) {
-      // Copy tensor to CPU tensor first
-      srcdata = create_coyotebuffer(*accl, srctensor->numel(), srctensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(srctensor_original.numel()));
-      empty_srctensor = torch::from_blob(
-          srcdata->byte_array(), srctensor_original.sizes(),
-          srctensor_original.options().device(c10::DeviceType::CPU));
-      srctensor = &empty_srctensor;
-      srctensor->copy_(srctensor_original);
-    } else if (srctensor_original.device().type() != c10::DeviceType::CPU) {
-      srcdata =
-          create_buffer(*accl, srctensor->numel(), srctensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(srctensor_original.numel()));
-      empty_srctensor = torch::from_blob(
-          srcdata->byte_array(), srctensor_original.sizes(),
-          srctensor_original.options().device(c10::DeviceType::CPU));
-      srctensor = &empty_srctensor;
-      srctensor->copy_(srctensor_original);
-    } else {
-      srcdata = create_buffer(*accl, *srctensor);
-    }
-  }
-
-  // Create new output tensor, since dsttensorvec is not continuous in memory
-  if (p2p_applicable(*accl, dsttensorvec[0], p2p_enabled)) {
-    dstdata = create_buffer_p2p(*accl,
-                                srctensor->numel() * static_cast<size_t>(size_),
-                                srctensor->scalar_type());
-  } else if (coyote_enabled) {
-    dstdata =
-        create_coyotebuffer(*accl, srctensor->numel() * static_cast<size_t>(size_),
-                      srctensor->scalar_type());
-    std::vector<int64_t> sizes = {static_cast<int64_t>(srctensor->numel()) *
-                                  size_};
-    dsttensor = torch::from_blob(
-        dstdata->byte_array(), sizes,
-        srctensor_original.options().device(c10::DeviceType::CPU));
-  } else {
-    dstdata =
-        create_buffer(*accl, srctensor->numel() * static_cast<size_t>(size_),
-                      srctensor->scalar_type());
-    std::vector<int64_t> sizes = {static_cast<int64_t>(srctensor->numel()) *
-                                  size_};
-    dsttensor = torch::from_blob(
-        dstdata->byte_array(), sizes,
-        srctensor_original.options().device(c10::DeviceType::CPU));
-  }
-
-  // Run allgather
-  ACCL::debug("Starting allgather of " + std::to_string(srctensor->numel()) +
-              " items");
-  if (!coyote_enabled) {
-    srcdata->sync_to_device();
-  }
-  accl->allgather(*srcdata, *dstdata, srctensor->numel(), ACCL::GLOBAL_COMM,
-                  true, true, get_compressed_type(srctensor->scalar_type()));
+      
+  PRE_REQUEST(Allgather,in_tensor);
 
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
-  if (!coyote_enabled) {
-    dstdata->sync_from_device();
-  }
+  int rounded_count = (in_tensor.numel() + 1023) & ~1023;
+      
+  auto req = accl->allgather(*in_buf, *out_buf, rounded_count);
 
-  // Copy results back to dsttensorvec
-  for (const auto i : c10::irange(dsttensorvec.size())) {
-    if (p2p_applicable(*accl, dsttensorvec[0], p2p_enabled)) {
-      auto slice =
-          dstdata->slice(i * srctensor->numel(), (i + 1) * srctensor->numel());
-      copy_back_p2p_buffer(*slice, dsttensorvec[i]);
-    } else {
-      dsttensorvec[i].copy_(dsttensor.slice(0, i * srctensor->numel(),
-                                            (i + 1) * srctensor->numel()));
-    }
-  }
+  copy_back_tensorvec(dsttensorvec, out_buf, in_tensor, in_tensor.numel(), rounded_count, true, true);
 }
 
 c10::intrusive_ptr<Work>
@@ -1171,24 +858,48 @@ ProcessGroupACCL::allgather(std::vector<std::vector<at::Tensor>> &outputTensors,
 
   std::function<void(std::unique_ptr<WorkEntry> &)> runFunc =
       [this](std::unique_ptr<WorkEntry> &entry) {
+	  ACCL::debug("Starting AllGather");
+        #ifdef ALLGATHER_SIDESTEP
+	ACCL::debug("[AllGather] -- Sidestepped using OpenMPI --");
+	auto data = (entry->src)[0];
+        std::vector<at::Tensor> outputDataVec = entry->dst;
+        auto flatOutputTensor = newLikeFlat(outputDataVec);
+
+        c10::DeviceGuard guard(data.device());
+        std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+        MPI_CHECK(MPI_Allgather(
+            data.data_ptr(),
+            data.numel(),
+            mpiDatatype.at(data.scalar_type()),
+            flatOutputTensor.data_ptr(),
+            data.numel(),
+            mpiDatatype.at(data.scalar_type()),
+            MPI_COMM_WORLD));
+
+        for (const auto i : c10::irange(outputDataVec.size())) {
+          outputDataVec[i].copy_(flatOutputTensor[i]);
+        }
+        #else
         auto srctensor = (entry->src)[0];
         auto &dsttensors = entry->dst;
         // Segment data if necessary
         if (srctensor.nbytes() > bufsize) {
-          size_t n = bufsize / srctensor.itemsize();
-          for (size_t i = 0; i < srctensor.numel(); i += n) {
+	  size_t non_zero_dim_count = srctensor.numel() / srctensor.size(0);
+          size_t n = bufsize / srctensor.itemsize() / non_zero_dim_count;
+          for (size_t i = 0; i < srctensor.size(0); i += n) {
             size_t end =
-                std::min(i + n, static_cast<size_t>(srctensor.numel()));
+                std::min(n, static_cast<size_t>(srctensor.size(0) - i));
             std::vector<at::Tensor> dsttensorslices;
             dsttensorslices.reserve(dsttensors.size());
             for (auto &dsttensor : dsttensors) {
-              dsttensorslices.emplace_back(dsttensor.slice(0, i, end));
+              dsttensorslices.emplace_back(dsttensor.narrow(0, i, end));
             }
-            run_allgather(srctensor.slice(0, i, end), dsttensorslices);
+            run_allgather(srctensor.narrow(0, i, end), dsttensorslices);
           }
         } else {
           run_allgather(srctensor, dsttensors);
         }
+      #endif
       };
   auto entry = std::make_unique<WorkEntry>(&inputTensors, &outputTensors[0],
                                            std::move(runFunc));
@@ -1203,109 +914,23 @@ c10::intrusive_ptr<Work> ProcessGroupACCL::allgather_coalesced(
   TORCH_CHECK(false, "ProcessGroupACCL does not support allgather_coalesced");
 }
 
-void ProcessGroupACCL::run_gather(at::Tensor srctensor_original,
+#undef COLL_NAME
+#define COLL_NAME Gather
+    
+void ProcessGroupACCL::run_gather(at::Tensor in_tensor,
                                   const std::vector<at::Tensor> &dsttensorvec,
                                   const GatherOptions &opts) {
-  at::Tensor *srctensor = &srctensor_original;
-  at::Tensor empty_srctensor;
-  std::unique_ptr<ACCL::BaseBuffer> srcdata;
-  at::Tensor dsttensor;
-  std::unique_ptr<ACCL::BaseBuffer> dstdata;
-
   // Reserve device
-  c10::DeviceGuard guard(srctensor->device());
+  c10::DeviceGuard guard(in_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Copy data from GPU to FPGA if necessary
-  if (p2p_applicable(*accl, srctensor_original, p2p_enabled)) {
-    srcdata = create_and_copy_p2p_buffer(*accl, srctensor_original);
-  } else {
-    if (coyote_enabled) {
-      srcdata =
-          create_coyotebuffer(*accl, srctensor->numel(), srctensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(srctensor_original.numel()));
-      empty_srctensor = torch::from_blob(
-          srcdata->byte_array(), srctensor_original.sizes(),
-          srctensor_original.options().device(c10::DeviceType::CPU));
-      srctensor = &empty_srctensor;
-      srctensor->copy_(srctensor_original);
-    } else if (srctensor_original.device().type() != c10::DeviceType::CPU) {
-      srcdata =
-          create_buffer(*accl, srctensor->numel(), srctensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(srctensor_original.numel()));
-      empty_srctensor = torch::from_blob(
-          srcdata->byte_array(), srctensor_original.sizes(),
-          srctensor_original.options().device(c10::DeviceType::CPU));
-      srctensor = &empty_srctensor;
-      srctensor->copy_(srctensor_original);
-    } else {
-      srcdata = create_buffer(*accl, *srctensor);
-    }
-  }
-
-  // Create new output tensor, since dsttensorvec is not continuous in memory
-  if (rank_ == opts.rootRank) {
-    if (p2p_applicable(*accl, dsttensorvec[0], p2p_enabled)) {
-      dstdata = create_buffer_p2p(
-          *accl, srctensor->numel() * static_cast<size_t>(size_),
-          srctensor->scalar_type());
-    } else if (coyote_enabled) {
-      dstdata =
-          create_coyotebuffer(*accl, srctensor->numel() * static_cast<size_t>(size_),
-                        srctensor->scalar_type());
-      std::vector<int64_t> sizes = {static_cast<int64_t>(srctensor->numel()) *
-                                    size_};
-      dsttensor =
-          torch::from_blob(dstdata->byte_array(), sizes,
-                           srctensor->options().device(c10::DeviceType::CPU));
-    } else {
-      dstdata =
-          create_buffer(*accl, srctensor->numel() * static_cast<size_t>(size_),
-                        srctensor->scalar_type());
-      std::vector<int64_t> sizes = {static_cast<int64_t>(srctensor->numel()) *
-                                    size_};
-      dsttensor =
-          torch::from_blob(dstdata->byte_array(), sizes,
-                           srctensor->options().device(c10::DeviceType::CPU));
-    }
-  }
-
-  // Run gather
-  ACCL::debug("Starting gather of " + std::to_string(srctensor->numel()) +
-              " items");
+  init_input_tensor(in_tensor, in_buf, true, true);
 
-  if (!coyote_enabled) {
-    srcdata->sync_to_device();
-  }
-
-  accl->gather(*srcdata, *dstdata, srctensor->numel(), opts.rootRank,
-               ACCL::GLOBAL_COMM, true, true,
-               get_compressed_type(srctensor->scalar_type()));
-
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
+  PRE_REQUEST(Gather, in_tensor);
 
-  if (!coyote_enabled && rank_ == opts.rootRank) {
-    dstdata->sync_from_device();
-  }
+  auto req = accl->gather(*in_buf, *out_buf, in_tensor.numel(), opts.rootRank);
 
-  // Copy results back to dsttensorvec
-  if (rank_ == opts.rootRank) {
-    for (const auto i : c10::irange(dsttensorvec.size())) {
-      if (p2p_applicable(*accl, dsttensorvec[0], p2p_enabled)) {
-        auto slice = dstdata->slice(i * srctensor->numel(),
-                                    (i + 1) * srctensor->numel());
-        copy_back_p2p_buffer(*slice, dsttensorvec[i]);
-      } else {
-        dsttensorvec[i].copy_(dsttensor.slice(0, i * srctensor->numel(),
-                                              (i + 1) * srctensor->numel()));
-      }
-    }
-  }
+  copy_back_tensorvec(dsttensorvec, out_buf, in_tensor, in_tensor.numel(), in_tensor.numel(), true, false, opts.rootRank);
 }
 
 c10::intrusive_ptr<Work>
@@ -1332,24 +957,59 @@ ProcessGroupACCL::gather(std::vector<std::vector<at::Tensor>> &outputTensors,
 
   std::function<void(std::unique_ptr<WorkEntry> &)> runFunc =
       [opts, this](std::unique_ptr<WorkEntry> &entry) {
+	#ifdef GATHER_SIDESTEP
+	ACCL::debug("[Gather] -- Sidestepped using OpenMPI --");
+	  auto data = (entry->src)[0];
+        void* recvbuf = nullptr;
+        at::Tensor flatOutputTensor;
+
+        std::vector<at::Tensor> dstdata = entry->dst;
+        if (rank_ == opts.rootRank) {
+          flatOutputTensor = newLikeFlat(dstdata);
+          recvbuf = flatOutputTensor.data_ptr();
+        }
+
+        c10::DeviceGuard guard(data.device());
+        std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+        MPI_CHECK(MPI_Gather(
+            data.data_ptr(),
+            data.numel(),
+            mpiDatatype.at(data.scalar_type()),
+            recvbuf,
+            data.numel(),
+            mpiDatatype.at(data.scalar_type()),
+            opts.rootRank,
+            MPI_COMM_WORLD));
+
+        if (rank_ == opts.rootRank) {
+          const std::vector<at::Tensor>& outputDataVec = entry->dst;
+          // copy the flattened output tensors to the outputs
+          for (const auto i : c10::irange(outputDataVec.size())) {
+            outputDataVec.at(i).copy_(flatOutputTensor[i]);
+          }
+        }
+	#else
         auto srctensor = (entry->src)[0];
         auto &dsttensors = entry->dst;
         // Segment data if necessary
         if (srctensor.nbytes() > bufsize) {
-          size_t n = bufsize / srctensor.itemsize();
-          for (size_t i = 0; i < srctensor.numel(); i += n) {
+	  size_t non_zero_dim_count = srctensor.numel() / srctensor.size(0);
+          size_t n = bufsize / srctensor.itemsize() / non_zero_dim_count;
+	  ACCL::debug("[Gather] Segmenting tensor of size " + std::to_string(srctensor.nbytes()) + " into " + std::to_string(n * non_zero_dim_count) + "-sized elements ");
+          for (size_t i = 0; i < srctensor.size(0); i += n) {
             size_t end =
-                std::min(i + n, static_cast<size_t>(srctensor.numel()));
+                std::min(n, static_cast<size_t>(srctensor.size(0)) - i);
             std::vector<at::Tensor> dsttensorslices;
             dsttensorslices.reserve(dsttensors.size());
             for (auto &dsttensor : dsttensors) {
-              dsttensorslices.emplace_back(dsttensor.slice(0, i, end));
+              dsttensorslices.emplace_back(dsttensor.narrow(0, i, end));
             }
-            run_gather(srctensor.slice(0, i, end), dsttensorslices, opts);
+            run_gather(srctensor.narrow(0, i, end), dsttensorslices, opts);
           }
         } else {
           run_gather(srctensor, dsttensors, opts);
         }
+      #endif
       };
 
   if (rank_ == opts.rootRank) {
@@ -1365,118 +1025,23 @@ ProcessGroupACCL::gather(std::vector<std::vector<at::Tensor>> &outputTensors,
   }
 }
 
-void ProcessGroupACCL::run_scatter(std::vector<at::Tensor> &srctensorvec,
-                                   at::Tensor dsttensor_original,
+#undef COLL_NAME
+#define COLL_NAME Scatter
+void ProcessGroupACCL::run_scatter(std::vector<at::Tensor> &in_tensor_vec,
+                                   at::Tensor out_tensor,
                                    const ScatterOptions &opts) {
-  std::unique_ptr<ACCL::BaseBuffer> srcdata;
-  at::Tensor *dsttensor = &dsttensor_original;
-  at::Tensor empty_dsttensor;
-  std::unique_ptr<ACCL::BaseBuffer> dstdata;
-
   // Reserve device
-  c10::DeviceGuard guard(dsttensor->device());
+  c10::DeviceGuard guard(out_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Create new input buffer, since srctensorvec is not continuous in memory
-  if (rank_ == opts.rootRank) {
-    at::Tensor srctensor;
-    if (rank_ == opts.rootRank) {
-      if (p2p_applicable(*accl, srctensorvec[0], p2p_enabled)) {
-        srcdata = create_buffer_p2p(
-            *accl, dsttensor->numel() * static_cast<size_t>(size_),
-            dsttensor->scalar_type());
-      } else if (coyote_enabled) {
-        srcdata = create_coyotebuffer(*accl,
-                                dsttensor->numel() * static_cast<size_t>(size_),
-                                dsttensor->scalar_type());
-        std::vector<int64_t> sizes = {static_cast<int64_t>(dsttensor->numel()) *
-                                      size_};
-        srctensor =
-            torch::from_blob(srcdata->byte_array(), sizes,
-                             dsttensor->options().device(c10::DeviceType::CPU));
-      } else {
-        srcdata = create_buffer(*accl,
-                                dsttensor->numel() * static_cast<size_t>(size_),
-                                dsttensor->scalar_type());
-        std::vector<int64_t> sizes = {static_cast<int64_t>(dsttensor->numel()) *
-                                      size_};
-        srctensor =
-            torch::from_blob(srcdata->byte_array(), sizes,
-                             dsttensor->options().device(c10::DeviceType::CPU));
-      }
-    }
-
-    // Copy data to input buffer
-    for (const auto i : c10::irange(srctensorvec.size())) {
-      if (p2p_applicable(*accl, srctensorvec[0], p2p_enabled)) {
-        auto slice = srcdata->slice(i * dsttensor->numel(),
-                                    (i + 1) * dsttensor->numel());
-        copy_to_p2p_buffer(*slice, srctensorvec[i]);
-      } else {
-        auto slice = srctensor.slice(0, i * dsttensor->numel(),
-                                     (i + 1) * dsttensor->numel());
-        slice.copy_(srctensorvec[i]);
-      }
-    }
-  }
-
-  // Create output buffer
-  if (p2p_applicable(*accl, dsttensor_original, p2p_enabled)) {
-    dstdata = create_and_copy_p2p_buffer(*accl, dsttensor_original);
-  } else {
-    if (coyote_enabled) {
-      dstdata =
-          create_coyotebuffer(*accl, dsttensor->numel(), dsttensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(dsttensor_original.numel()));
-      empty_dsttensor = torch::from_blob(
-          dstdata->byte_array(), dsttensor_original.sizes(),
-          dsttensor_original.options().device(c10::DeviceType::CPU));
-      dsttensor = &empty_dsttensor;
-      dsttensor->copy_(dsttensor_original);
-    } else if (dsttensor_original.device().type() != c10::DeviceType::CPU) {
-      dstdata =
-          create_buffer(*accl, dsttensor->numel(), dsttensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(dsttensor_original.numel()));
-      empty_dsttensor = torch::from_blob(
-          dstdata->byte_array(), dsttensor_original.sizes(),
-          dsttensor_original.options().device(c10::DeviceType::CPU));
-      dsttensor = &empty_dsttensor;
-      dsttensor->copy_(dsttensor_original);
-    } else {
-      dstdata = create_buffer(*accl, *dsttensor);
-    }
-  }
-
-  // Run scatter
-  ACCL::debug("Starting scatter of " + std::to_string(dsttensor->numel()) +
-              " items");
 
-  if (!coyote_enabled && rank_ == opts.rootRank) {
-    srcdata->sync_to_device();
-  }
-
-  accl->scatter(*srcdata, *dstdata, dsttensor->numel(), opts.rootRank,
-                ACCL::GLOBAL_COMM, true, true,
-                get_compressed_type(dsttensor->scalar_type()));
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
+  init_input_data_vec(in_tensor_vec, in_buf, out_tensor.options().device(c10::DeviceType::CPU), true, false, opts.rootRank);
+  
+  PRE_REQUEST(Scatter, out_tensor);
+  
+  auto req = accl->scatter(*in_buf, *out_buf, out_tensor.numel(), opts.rootRank);
 
-  if (!coyote_enabled) {
-    dstdata->sync_from_device();
-  }
-
-  // Copy result back to GPU if necessary
-  if (p2p_applicable(*accl, dsttensor_original, p2p_enabled)) {
-    copy_back_p2p_buffer(*dstdata, dsttensor_original);
-  } else if (coyote_enabled || dsttensor_original.device().type() != c10::DeviceType::CPU) {
-    ACCL::debug("Copying data back from CPU tensor of size " +
-                std::to_string(dsttensor_original.numel()));
-    dsttensor_original.copy_(*dsttensor);
-  }
+  copy_back_tensor(out_tensor, out_buf, true, true, opts.rootRank);
 }
 
 c10::intrusive_ptr<Work>
@@ -1503,26 +1068,56 @@ ProcessGroupACCL::scatter(std::vector<at::Tensor> &outputTensors,
 
   std::function<void(std::unique_ptr<WorkEntry> &)> runFunc =
       [opts, this](std::unique_ptr<WorkEntry> &entry) {
+        #ifdef SCATTER_SIDESTEP
+	ACCL::debug("[Scatter] -- Sidestepped using OpenMPI --");
+	auto data = (entry->dst)[0];
+        void* sendbuf = nullptr;
+        at::Tensor flatInputTensor;
+
+        if (rank_ == opts.rootRank) {
+          std::vector<at::Tensor>& inputDataVec = entry->src;
+          flatInputTensor = newLikeFlat(inputDataVec);
+          sendbuf = flatInputTensor.data_ptr();
+
+          // copy the input tensors to the flatten large send buffer
+          for (const auto i : c10::irange(inputDataVec.size())) {
+            flatInputTensor[i].copy_(inputDataVec.at(i));
+          }
+        }
+
+        c10::DeviceGuard guard(data.device());
+        std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+        MPI_CHECK(MPI_Scatter(
+            sendbuf,
+            data.numel(),
+            mpiDatatype.at(data.scalar_type()),
+            data.data_ptr(),
+            data.numel(),
+            mpiDatatype.at(data.scalar_type()),
+            opts.rootRank,
+            MPI_COMM_WORLD));
+        #else
         auto &srctensors = entry->src;
         auto dsttensor = (entry->dst)[0];
         // Segment data if necessary
         if (dsttensor.nbytes() > bufsize) {
-          ACCL::debug("dsttensor to large!");
-          size_t n = bufsize / dsttensor.itemsize();
-          for (size_t i = 0; i < dsttensor.numel(); i += n) {
+	  size_t non_zero_dim_count = dsttensor.numel() / dsttensor.size(0);
+          size_t n = bufsize / 4 / dsttensor.itemsize() / non_zero_dim_count;
+          for (size_t i = 0; i < dsttensor.size(0); i += n) {
             ACCL::debug("part " + std::to_string(i) + "!");
             size_t end =
-                std::min(i + n, static_cast<size_t>(dsttensor.numel()));
+                std::min(n, static_cast<size_t>(dsttensor.size(0)) - i);
             std::vector<at::Tensor> srctensorslices;
             srctensorslices.reserve(srctensors.size());
             for (auto &srctensor : srctensors) {
-              srctensorslices.emplace_back(srctensor.slice(0, i, end));
+              srctensorslices.emplace_back(srctensor.narrow(0, i, end));
             }
-            run_scatter(srctensorslices, dsttensor.slice(0, i, end), opts);
+            run_scatter(srctensorslices, dsttensor.narrow(0, i, end), opts);
           }
         } else {
           run_scatter(srctensors, dsttensor, opts);
         }
+        #endif
       };
 
   if (rank_ == opts.rootRank) {
@@ -1549,89 +1144,57 @@ c10::intrusive_ptr<Work> ProcessGroupACCL::reduce_scatter(
   TORCH_CHECK(false, "ProcessGroupACCL does not support reduce_scatter");
 }
 
-void ProcessGroupACCL::run_alltoall(at::Tensor srctensor_original,
-                                    at::Tensor dsttensor_original,
+#undef COLL_NAME
+#define COLL_NAME AlltoAll
+
+void ProcessGroupACCL::run_alltoall(at::Tensor in_tensor,
+                                    at::Tensor out_tensor,
                                     const AllToAllOptions &opts) {
-  at::Tensor *srctensor = &srctensor_original;
-  at::Tensor *dsttensor = &dsttensor_original;
-  at::Tensor empty_srctensor, empty_dsttensor;
-  std::unique_ptr<ACCL::BaseBuffer> srcdata;
-  std::unique_ptr<ACCL::BaseBuffer> dstdata;
+
+  init_input_tensor(in_tensor, in_buf, true, true);
 
   // Reserve device
-  c10::DeviceGuard guard(srctensor->device());
+  c10::DeviceGuard guard(in_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Copy data from GPU to FPGA if necessary, and create a new result buffer,
-  // since ACCL doesn't support in-place allreduce
-  if (p2p_applicable(*accl, srctensor_original, p2p_enabled)) {
-    srcdata = create_and_copy_p2p_buffer(*accl, srctensor_original);
-  } else {
-    if (accl->is_simulated() || coyote_enabled) {
-      srcdata = create_buffer(*accl, srctensor->numel(), srctensor->scalar_type());
-    } else {
-      srcdata = wrap_buffer(*accl, buf0, srctensor->numel(), srctensor->scalar_type());
-    }
-    ACCL::debug("Copying data to aligned CPU tensor of size " +
-                std::to_string(srctensor_original.numel()));
-    empty_srctensor = torch::from_blob(
-        srcdata->byte_array(), srctensor_original.sizes(),
-        srctensor_original.options().device(c10::DeviceType::CPU));
-    srctensor = &empty_srctensor;
-    srctensor->copy_(srctensor_original);
-    ACCL::debug("Creating extra result buffer of size " +
-                std::to_string(srctensor_original.numel()));
-  }
+  // init_output_data(out_tensor, dstdata, out_tensor.numel(), out_tensor.scalar_type(), true, true);
 
-  // Create output buffer
-  if (p2p_applicable(*accl, dsttensor_original, p2p_enabled)) {
-    dstdata = create_and_copy_p2p_buffer(*accl, dsttensor_original);
-  } else {
-    if (accl->is_simulated() || coyote_enabled) {
-      dstdata = create_buffer(*accl, dsttensor->numel(), dsttensor->scalar_type());
-    } else {
-      dstdata = wrap_buffer(*accl, buf0, dsttensor->numel(), dsttensor->scalar_type());
-    }
-  }
+  PRE_REQUEST(AlltoAll, in_tensor);
 
-  // Run alltoall
-  ACCL::debug("Starting alltoall of " + std::to_string(srctensor->numel()) +
-              " items");
-  if (!coyote_enabled) {
-    srcdata->sync_to_device();
-  }
-  accl->alltoall(*srcdata, *dstdata, srctensor->numel(),
-                  ACCL::GLOBAL_COMM, true, true,
-                  get_compressed_type(srctensor->scalar_type()));
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
-  if (!coyote_enabled) {
-    dstdata->sync_from_device();
-  }
+  auto req = accl->alltoall(*in_buf, *out_buf, in_tensor.numel()/size_);
 
-  // Copy result buffer back to original tensor
-  if (p2p_applicable(*accl, dsttensor_original, p2p_enabled)) {
-    copy_back_p2p_buffer(*dstdata, dsttensor_original);
-  } else {
-    ACCL::debug("Copying result data back to original tensor of size " +
-                std::to_string(dsttensor_original.numel()));
-    dsttensor_original.copy_(torch::from_blob(
-        dstdata->byte_array(), dsttensor_original.sizes(),
-        dsttensor_original.options().device(c10::DeviceType::CPU)));
-  }
+  copy_back_tensor(out_tensor, out_buf, true, true);    
+}
+
+    
+void ProcessGroupACCL::run_alltoall_vec(std::vector<at::Tensor> &in_tensor_vec,
+                                    std::vector<at::Tensor> &out_tensor_vec,
+                                    const AllToAllOptions &opts) {
+  int a2a_nbytes = in_tensor_vec[0].nbytes();
+  
+  c10::DeviceGuard guard(in_tensor_vec[0].device());
+  std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+
+  init_input_data_vec(in_tensor_vec, in_buf, out_tensor_vec[0].options().device(c10::DeviceType::CPU), true, true);
+
+  PRE_REQUEST(AlltoAll, in_tensor_vec[0]);
+
+  auto req = accl->alltoall(*in_buf, *out_buf, in_tensor_vec[0].numel());
+
+  copy_back_tensorvec(out_tensor_vec, out_buf, in_tensor_vec[0], in_tensor_vec[0].numel(), in_tensor_vec[0].numel(), true, true);
+      
 }
 
 c10::intrusive_ptr<Work> ProcessGroupACCL::alltoall_base(
     at::Tensor &outputTensor, at::Tensor &inputTensor,
     std::vector<int64_t> &outputSplitSizes,
     std::vector<int64_t> &inputSplitSizes, const AllToAllOptions &opts) {
+    ACCL::debug("Starting AlltoAll");
   if (outputSplitSizes.size() == 0 && inputSplitSizes.size() == 0) {
     // We can use alltoall
     TORCH_CHECK(
         outputTensor.numel() == inputTensor.numel() &&
-            outputTensor.type() == inputTensor.type(),
+            outputTensor.scalar_type() == inputTensor.scalar_type(),
         "Tensors are not equal in size or data type");
     TORCH_CHECK(
         outputTensor.size(0) % size_ == 0,
@@ -1641,17 +1204,34 @@ c10::intrusive_ptr<Work> ProcessGroupACCL::alltoall_base(
         [opts, this](std::unique_ptr<WorkEntry>& entry) {
           auto srctensor = (entry->src)[0];
           auto dsttensor = (entry->dst)[0];
-          c10::DeviceGuard guard(srctensor.device());
-          std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
+
+
+          // c10::DeviceGuard guard(srctensor.device());
+          // std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
           // Segment data if necessary
           if (dsttensor.nbytes() > bufsize) {
-            ACCL::debug("dsttensor to large!");
-            size_t n = bufsize / dsttensor.itemsize();
-            for (size_t i = 0; i < dsttensor.numel(); i += n) {
+
+	    // Split individual entries
+	    size_t non_zero_dim_count = dsttensor.numel() / dsttensor.size(0);
+	    size_t n = bufsize / dsttensor.itemsize() / size_ / non_zero_dim_count;
+	    size_t entry_size = dsttensor.numel() / size_ / non_zero_dim_count;
+            for (size_t i = 0; i < entry_size; i += n) {
               ACCL::debug("part " + std::to_string(i) + "!");
-              size_t end =
-                  std::min(i + n, static_cast<size_t>(dsttensor.numel()));
-              run_alltoall(srctensor.slice(0, i, end), dsttensor.slice(0, i, end), opts);
+              size_t end = std::min(n, static_cast<size_t>(entry_size) - i);
+
+	      std::vector<at::Tensor> srctensorslices;
+	      srctensorslices.reserve(size_);
+	      for (int j = 0; j < size_; j++) {
+		  int bufpos = j * entry_size;
+		  srctensorslices.emplace_back(srctensor.narrow(0, i + bufpos, end));
+	      }
+	      std::vector<at::Tensor> dsttensorslices;
+	      dsttensorslices.reserve(size_);
+	      for (int j = 0; j < size_; j++) {
+		  int bufpos = j * entry_size;
+		  dsttensorslices.emplace_back(dsttensor.narrow(0, i + bufpos, end));
+	      }
+              run_alltoall_vec(srctensorslices, dsttensorslices, opts);
             }
           } else {
             run_alltoall(srctensor, dsttensor, opts);
@@ -1674,63 +1254,29 @@ c10::intrusive_ptr<Work>
 ProcessGroupACCL::alltoall(std::vector<at::Tensor> &outputTensors,
                            std::vector<at::Tensor> &inputTensors,
                            const AllToAllOptions &opts) {
+  ACCL::debug("ProcessGroupACCL does not support alltoall");
   TORCH_CHECK(false, "ProcessGroupACCL does not support alltoall");
 }
 
-void ProcessGroupACCL::run_send(at::Tensor tensor_original, int dstRank,
+#undef COLL_NAME
+#define COLL_NAME Send
+
+void ProcessGroupACCL::run_send(at::Tensor in_tensor, int dstRank,
                                 int tag) {
-  at::Tensor *tensor = &tensor_original;
-  at::Tensor empty_tensor;
-  std::unique_ptr<ACCL::BaseBuffer> data;
 
   // Reserve device
-  c10::DeviceGuard guard(tensor->device());
+  c10::DeviceGuard guard(in_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Copy data from GPU to FPGA if necessary
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    data = create_and_copy_p2p_buffer(*accl, tensor_original);
-  } else {
-    if (coyote_enabled) {
-      // Copy tensor to CPU tensor first
-      data = create_coyotebuffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-      tensor->copy_(tensor_original);
-    } else if (tensor_original.device().type() != c10::DeviceType::CPU) {
-      // Copy tensor to CPU tensor first
-      data = create_buffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-      tensor->copy_(tensor_original);
-    } else {
-      data = create_buffer(*accl, *tensor);
-    }
-  }
+  init_input_tensor(in_tensor, in_buf, true, true);
 
-  // Run send
-  ACCL::debug("Starting send of " + std::to_string(tensor->numel()) +
-              " items to " + std::to_string(dstRank));
-  if (!coyote_enabled) {
-    data->sync_to_device();
-  }
-  accl->send(*data, tensor->numel(), dstRank, tag, ACCL::GLOBAL_COMM, true,
-             get_compressed_type(tensor->scalar_type()));
+  PRE_REQUEST(Send,in_tensor);
+  
+  ACCL::ACCLRequest* req = accl->send(*in_buf, in_tensor.numel(), dstRank, tag);
 
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
 }
 
+
 c10::intrusive_ptr<Work>
 ProcessGroupACCL::send(std::vector<at::Tensor> &tensors, int dstRank, int tag) {
   checkSingleTensor(tensors);
@@ -1742,8 +1288,8 @@ ProcessGroupACCL::send(std::vector<at::Tensor> &tensors, int dstRank, int tag) {
         if (tensor.nbytes() > bufsize) {
           size_t n = bufsize / tensor.itemsize();
           for (size_t i = 0; i < tensor.numel(); i += n) {
-            size_t end = std::min(i + n, static_cast<size_t>(tensor.numel()));
-            run_send(tensor.slice(0, i, end), dstRank, tag);
+            size_t end = std::min(n, static_cast<size_t>(tensor.numel()) - i);
+            run_send(tensor.narrow(0, i, end), dstRank, tag);
           }
         } else {
           run_send(tensor, dstRank, tag);
@@ -1756,65 +1302,20 @@ ProcessGroupACCL::send(std::vector<at::Tensor> &tensors, int dstRank, int tag) {
                  c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-void ProcessGroupACCL::run_recv(at::Tensor tensor_original, int srcRank,
+#undef COLL_NAME
+#define COLL_NAME Recv    
+    
+void ProcessGroupACCL::run_recv(at::Tensor out_tensor, int srcRank,
                                 int tag) {
-  at::Tensor *tensor = &tensor_original;
-  at::Tensor empty_tensor;
-  std::unique_ptr<ACCL::BaseBuffer> data;
 
-  // Reserve device
-  c10::DeviceGuard guard(tensor->device());
+  c10::DeviceGuard guard(out_tensor.device());
   std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
 
-  // Create FPGA buffer
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    data = create_buffer_p2p(*accl, tensor_original);
-  } else {
-    if (coyote_enabled) {
-      // Copy tensor to CPU tensor first
-      data = create_coyotebuffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Creating CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-    } else if (tensor_original.device().type() != c10::DeviceType::CPU) {
-      data = create_buffer(*accl, tensor->numel(), tensor->scalar_type());
-      ACCL::debug("Copying data to CPU tensor of size " +
-                  std::to_string(tensor_original.numel()));
-      empty_tensor = torch::from_blob(
-          data->byte_array(), tensor_original.sizes(),
-          tensor_original.options().device(c10::DeviceType::CPU));
-      tensor = &empty_tensor;
-    } else {
-      data = create_buffer(*accl, *tensor);
-    }
-  }
-
-  // Run recieve
-  ACCL::debug("Starting recieve of " + std::to_string(tensor->numel()) +
-              " items from " + std::to_string(srcRank));
-  accl->recv(*data, tensor->numel(), srcRank, tag, ACCL::GLOBAL_COMM, true,
-             get_compressed_type(tensor->scalar_type()));
+  PRE_REQUEST(Receive, out_tensor);
+  
+  ACCL::ACCLRequest* req = accl->recv(*out_buf, out_tensor.numel(), srcRank, tag);
 
-  int retcode = accl->get_retcode();
-  if (retcode) {
-    TORCH_CHECK(false, ACCL_ERROR(retcode));
-  }
-
-  if (!coyote_enabled) {
-    data->sync_from_device();
-  }
-
-  // Copy data back to original tensor if necessary
-  if (p2p_applicable(*accl, tensor_original, p2p_enabled)) {
-    copy_back_p2p_buffer(*data, tensor_original);
-  } else if (coyote_enabled || tensor_original.device().type() != c10::DeviceType::CPU) {
-    ACCL::debug("Copying data back from CPU tensor of size " +
-                std::to_string(tensor_original.numel()));
-    tensor_original.copy_(*tensor);
-  }
+  copy_back_tensor(out_tensor, out_buf, true, true);
 }
 
 c10::intrusive_ptr<Work>
@@ -1828,8 +1329,8 @@ ProcessGroupACCL::recv(std::vector<at::Tensor> &tensors, int srcRank, int tag) {
         if (tensor.nbytes() > bufsize) {
           size_t n = bufsize / tensor.itemsize();
           for (size_t i = 0; i < tensor.numel(); i += n) {
-            size_t end = std::min(i + n, static_cast<size_t>(tensor.numel()));
-            run_recv(tensor.slice(0, i, end), srcRank, tag);
+            size_t end = std::min(n, static_cast<size_t>(tensor.numel()) - i);
+            run_recv(tensor.narrow(0, i, end), srcRank, tag);
           }
         } else {
           run_recv(tensor, srcRank, tag);
@@ -1842,6 +1343,9 @@ ProcessGroupACCL::recv(std::vector<at::Tensor> &tensors, int srcRank, int tag) {
                  c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
+#undef COLL_NAME
+#define COLL_NAME Unnamed
+    
 c10::intrusive_ptr<Work>
 ProcessGroupACCL::recvAnysource(std::vector<at::Tensor> &tensors, int tag) {
   TORCH_CHECK(false, "ProcessGroupACCL does not support recvAnysource");
@@ -1849,7 +1353,7 @@ ProcessGroupACCL::recvAnysource(std::vector<at::Tensor> &tensors, int tag) {
 
 c10::intrusive_ptr<Work>
 ProcessGroupACCL::barrier(const BarrierOptions &opts) {
-  TORCH_CHECK(false, "ProcessGroupACCL does not support barrier");
+  accl->barrier();
 }
 
 c10::intrusive_ptr<Work>
@@ -1890,9 +1394,6 @@ PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
              "TCP ACCL backend; uses EasyNet network kernel on hardware")
       .value("udp", accl_network_utils::acclDesign::UDP,
              "UDP ACCL backend; uses VNx network kernel on hardware")
-      .value("roce", accl_network_utils::acclDesign::ROCE,
-             "Only applicable for hardware; uses UDP ACCL backend and RoCE "
-             "network kernel")
       .value("cyt_tcp", accl_network_utils::acclDesign::CYT_TCP,
              "Only applicable for hardware; uses coyote ACCL backend with a "
              "TCP network kernel")
diff --git a/integrations/pytorch_ddp/src/coyote_init.cpp b/integrations/pytorch_ddp/src/coyote_init.cpp
deleted file mode 100644
index a6be2cc4..00000000
--- a/integrations/pytorch_ddp/src/coyote_init.cpp
+++ /dev/null
@@ -1,103 +0,0 @@
-/*****************************************************************************
-  Copyright (C) 2023 Advanced Micro Devices, Inc
-
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License.
-
-*****************************************************************************/
-
-#include "coyote_init.hpp"
-#include <arpa/inet.h>
-#include <iostream>
-
-namespace {
-inline void swap_endianness(uint32_t *ip) {
-  uint8_t *ip_bytes = reinterpret_cast<uint8_t *>(ip);
-  *ip = (ip_bytes[3] << 0) | (ip_bytes[2] << 8) | (ip_bytes[1] << 16) |
-        (ip_bytes[0] << 24);
-}
-
-uint32_t _ip_encode(std::string ip) {
-  struct sockaddr_in sa;
-  inet_pton(AF_INET, ip.c_str(), &(sa.sin_addr));
-  swap_endianness(&sa.sin_addr.s_addr);
-  return sa.sin_addr.s_addr;
-}
-
-std::string ip_decode(uint32_t ip) {
-  char buffer[INET_ADDRSTRLEN];
-  struct in_addr sa;
-  sa.s_addr = ip;
-  swap_endianness(&sa.s_addr);
-  inet_ntop(AF_INET, &sa, buffer, INET_ADDRSTRLEN);
-  return std::string(buffer, INET_ADDRSTRLEN);
-}
-
-void exchange_qp(unsigned int first_rank, unsigned int second_rank,
-                 unsigned int local_rank,
-                 std::vector<fpga::ibvQpConn *> &ibvQpConn_vec,
-                 std::vector<ACCL::rank_t> &ranks) {
-  // write established connection to hardware and perform arp lookup
-  if (local_rank == first_rank) {
-    int connection =
-        (ibvQpConn_vec[second_rank]->getQpairStruct()->local.qpn & 0xFFFF) |
-        ((ibvQpConn_vec[second_rank]->getQpairStruct()->remote.qpn & 0xFFFF)
-         << 16);
-    ibvQpConn_vec[second_rank]->setConnection(connection);
-    ibvQpConn_vec[second_rank]->writeContext(ranks[second_rank].port);
-    ibvQpConn_vec[second_rank]->doArpLookup();
-    ranks[second_rank].session_id =
-        ibvQpConn_vec[second_rank]->getQpairStruct()->local.qpn;
-  } else if (local_rank == second_rank) {
-    int connection =
-        (ibvQpConn_vec[first_rank]->getQpairStruct()->local.qpn & 0xFFFF) |
-        ((ibvQpConn_vec[first_rank]->getQpairStruct()->remote.qpn & 0xFFFF)
-         << 16);
-    ibvQpConn_vec[first_rank]->setConnection(connection);
-    ibvQpConn_vec[first_rank]->writeContext(ranks[first_rank].port);
-    ibvQpConn_vec[first_rank]->doArpLookup();
-    ranks[first_rank].session_id =
-        ibvQpConn_vec[first_rank]->getQpairStruct()->local.qpn;
-  }
-}
-
-} // namespace
-
-namespace coyote_init {
-void setup_cyt_rdma(std::vector<fpga::ibvQpConn *> &ibvQpConn_vec,
-                    std::vector<ACCL::rank_t> &ranks, int local_rank,
-                    ACCL::CoyoteDevice &device) {
-  std::cout << "[ACCL Coyote] Initializing QP..." << std::endl;
-  // create single page dummy memory space for each qp
-  uint32_t n_pages = 1;
-  for (int i = 0; i < ranks.size(); i++) {
-    fpga::ibvQpConn *qpConn = new fpga::ibvQpConn(
-        device.coyote_qProc_vec[i], ranks[local_rank].ip, n_pages);
-    ibvQpConn_vec.push_back(qpConn);
-  }
-}
-
-void configure_cyt_rdma(std::vector<fpga::ibvQpConn *> &ibvQpConn_vec,
-                        std::vector<ACCL::rank_t> &ranks, int local_rank) {
-  std::cout << "[ACCL Coyote] Exchanging QP..." << std::endl;
-  for (int first_rank = 0; first_rank < ranks.size(); first_rank++) {
-    for (int second_rank = first_rank + 1; second_rank < ranks.size();
-         second_rank++) {
-      exchange_qp(first_rank, second_rank, local_rank, ibvQpConn_vec, ranks);
-      this_thread::sleep_for(500ms);
-    }
-  }
-
-  this_thread::sleep_for(3s);
-  std::cout << "[ACCL Coyote] Finished exchanging QP!" << std::endl;
-}
-} // namespace coyote_init
diff --git a/integrations/pytorch_ddp/test/test-compression.py b/integrations/pytorch_ddp/test/old_testscripts/test-compression.py
similarity index 100%
rename from integrations/pytorch_ddp/test/test-compression.py
rename to integrations/pytorch_ddp/test/old_testscripts/test-compression.py
diff --git a/integrations/pytorch_ddp/test/test-coyote.py b/integrations/pytorch_ddp/test/old_testscripts/test-coyote.py
similarity index 98%
rename from integrations/pytorch_ddp/test/test-coyote.py
rename to integrations/pytorch_ddp/test/old_testscripts/test-coyote.py
index 6cf1927e..c3b63422 100644
--- a/integrations/pytorch_ddp/test/test-coyote.py
+++ b/integrations/pytorch_ddp/test/old_testscripts/test-coyote.py
@@ -161,10 +161,10 @@ def start_test(simulator: bool):
              for i in range(size)]
 
     if simulator:
-        accl.create_simulate_process_group(ranks, bufsize=rxbufsize)
+        accl.create_simulate_process_group(ranks, bufsize=rxbufsize, initialize=True)
     else:
         accl.create_process_group_coyote(ranks, accl.ACCLDesign.cyt_rdma,
-                                         bufsize=rxbufsize)
+                                         bufsize=rxbufsize, initialize=True)
     dist.init_process_group("ACCL", rank=rank, world_size=size)
 
     with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
diff --git a/integrations/pytorch_ddp/test/test-gpu-p2p.py b/integrations/pytorch_ddp/test/old_testscripts/test-gpu-p2p.py
similarity index 100%
rename from integrations/pytorch_ddp/test/test-gpu-p2p.py
rename to integrations/pytorch_ddp/test/old_testscripts/test-gpu-p2p.py
diff --git a/integrations/pytorch_ddp/test/test-gpu.py b/integrations/pytorch_ddp/test/old_testscripts/test-gpu.py
similarity index 100%
rename from integrations/pytorch_ddp/test/test-gpu.py
rename to integrations/pytorch_ddp/test/old_testscripts/test-gpu.py
diff --git a/integrations/pytorch_ddp/test/test-segmentation.py b/integrations/pytorch_ddp/test/old_testscripts/test-segmentation.py
similarity index 100%
rename from integrations/pytorch_ddp/test/test-segmentation.py
rename to integrations/pytorch_ddp/test/old_testscripts/test-segmentation.py
diff --git a/integrations/pytorch_ddp/test/test-udp.py b/integrations/pytorch_ddp/test/old_testscripts/test-udp.py
similarity index 100%
rename from integrations/pytorch_ddp/test/test-udp.py
rename to integrations/pytorch_ddp/test/old_testscripts/test-udp.py
diff --git a/integrations/pytorch_ddp/test/test.py b/integrations/pytorch_ddp/test/old_testscripts/test.py
similarity index 100%
rename from integrations/pytorch_ddp/test/test.py
rename to integrations/pytorch_ddp/test/old_testscripts/test.py
diff --git a/integrations/pytorch_ddp/test/plot_composition.py b/integrations/pytorch_ddp/test/plot_composition.py
new file mode 100644
index 00000000..fca2dd95
--- /dev/null
+++ b/integrations/pytorch_ddp/test/plot_composition.py
@@ -0,0 +1,80 @@
+import re
+import matplotlib.pyplot as plt
+import seaborn as sns
+import numpy as np
+
+
+# keywords = ["Broadcast", "Allreduce", "AlltoAll",  ]
+# keywords = ["Broadcast", "Allreduce" ]
+keywords = ["Allreduce", "AlltoAll",  ]
+# parts = ["lib","barrier","total"]
+parts = ["lib","barrier","copy","lock","init", "type", "total"]
+
+part_pattern = re.compile(r"(.*)_tensor durationUs:.*")
+
+measurement_pattern = re.compile(r".*_tensor durationUs:(.*)")
+
+# keyword_pattern = re.compile(r"Starting (Broadcast|Allreduce|AlltoAll)")
+keyword_pattern = re.compile(r"Starting (Allreduce|AlltoAll)")
+# keyword_pattern = re.compile(r"Starting (Broadcast|Allreduce)")
+
+log_file_path = './accl_log/rank_0_stderr'
+
+with open(log_file_path, 'r') as log_file:
+    lines = log_file.readlines()
+
+current_keyword = None
+# results = { "Broadcast": {}, "Allreduce": {}, "AlltoAll": {} }
+# results = { "Broadcast": {}, "Allreduce": {}}
+results = { "Allreduce": {}, "AlltoAll": {} }
+
+# averages = { "Broadcast": {}, "Allreduce": {}}
+# averages = { "Broadcast": {}, "Allreduce": {}, "AlltoAll": {} }
+averages = { "Allreduce": {}, "AlltoAll": {} }
+
+for op in results:
+    for part in parts:
+        results[op][part] = []
+
+for line in lines:
+    keyword_match = keyword_pattern.search(line)
+    if keyword_match:
+        current_keyword = keyword_match.group(1)
+        continue
+
+    if current_keyword:
+        part_match = part_pattern.search(line)
+        if not part_match:
+            continue
+        part = part_match.group(1).strip()
+        if part in parts:
+            measurement_match = measurement_pattern.search(line)
+            measurement = measurement_match.group(1).strip()
+            results[current_keyword][part].append(float(measurement))
+            if part=='total':
+                current_keyword = None
+
+for op, parts in results.items():
+    for part, values in parts.items():
+        test_sum = 0
+        count = 0
+        for el in values:
+            test_sum +=  el
+            count += 1
+        averages[op][part] = test_sum / count    
+
+for op, part in averages.items():
+    labels = [key for key in part if key != 'total' and key != 'barrier']
+    times = [part[key] for key in labels]
+    total_time = part['total'] - part['barrier']
+    other_time = total_time - sum(times)
+
+    if other_time > 0:
+        labels.append('Other')
+        times.append(other_time)
+    
+    plt.figure()
+    plt.pie(times, labels=labels, autopct=lambda p: f'{p * total_time / 100:.2f}us')
+    plt.title(f'Runtime Distribution for {op}')
+
+    plt.savefig('composition_' + op + '_plot.png')
diff --git a/integrations/pytorch_ddp/test/plot_small.py b/integrations/pytorch_ddp/test/plot_small.py
new file mode 100644
index 00000000..f1575671
--- /dev/null
+++ b/integrations/pytorch_ddp/test/plot_small.py
@@ -0,0 +1,130 @@
+import re
+import matplotlib.pyplot as plt
+import seaborn as sns
+import numpy as np
+import matplotlib.ticker as mticker
+
+keywords = ["Broadcast", "Allreduce", "AlltoAll" ]
+# parts = ["lib","barrier","total"]
+parts = ["lib", "copy","init", "total", "device", "pytorch", "sleep"]
+parts_plot = ["init", "device", "lib_oh", "copy", "total_oh", "pytorch_oh"]
+
+
+part_pattern = re.compile(r"(.*)_.*_.* durationUs: .*")
+op_pattern = re.compile(r".*_(.*)_.* durationUs: .*")
+count_pattern = re.compile(r".*_.*_(.*) durationUs: .*")
+
+measurement_pattern = re.compile(r".*_.*_.* durationUs: (.*)")
+
+
+log_file_path = './accl_log/rank_0_stderr'
+
+with open(log_file_path, 'r') as log_file:
+    lines = log_file.readlines()
+
+current_keyword = None
+results = { "Broadcast": {}, "Allreduce": {}, "AlltoAll": {}}
+averages = { "Broadcast": {}, "Allreduce": {}, "AlltoAll": {}}
+
+# results = { "Broadcast": {}, "Allreduce": {}}
+
+sizes = []
+
+for op in results:
+    for part in parts:
+        results[op][part] = {}
+        averages[op][part] = {}
+    for part in parts_plot:        
+        averages[op][part] = {}
+for line in lines:
+    part_match = part_pattern.search(line)
+    op_match = op_pattern.search(line)
+    count_match = count_pattern.search(line)
+    if (not part_match) or (not op_match) or (not count_match):
+        continue
+    part = part_match.group(1).strip()
+    op = op_match.group(1).strip()
+    cnt = int(count_match.group(1).strip())
+    if cnt > 2097152:
+        continue
+    if op not in keywords:
+        continue
+    
+    if cnt not in sizes:
+        sizes.append(cnt)
+    if part in parts:
+        measurement_match = measurement_pattern.search(line)
+        measurement = measurement_match.group(1).strip()
+        if cnt not in results[op][part].keys():
+            results[op][part][cnt] = []
+        results[op][part][cnt].append(float(measurement))
+
+        
+for op, parts in results.items():
+    for part, cnts in parts.items():
+        for cnt, mes in cnts.items():
+            test_sum = 0
+            count = 0
+            for el in mes:
+                test_sum +=  el
+                count += 1
+            averages[op][part][cnt] = test_sum / count    
+
+sizes.sort()
+            
+for op, parts in averages.items():
+    for cnt in sizes:
+        if cnt == 32:
+            print(averages[op]['lib_oh'])
+            print(parts['lib'])
+            print(parts['device'])
+        averages[op]['lib_oh'][cnt] = parts['lib'][cnt] - parts['device'][cnt]
+        averages[op]['total_oh'][cnt] = parts['total'][cnt] - parts['sleep'][cnt]  - parts['lib'][cnt] - parts['init'][cnt] - parts['copy'][cnt]
+        averages[op]['pytorch_oh'][cnt] = parts['pytorch'][cnt] - (parts['total'][cnt])
+
+    averages[op].pop('lib')
+    averages[op].pop('total')
+    averages[op].pop('sleep')
+    averages[op].pop('pytorch')    
+
+
+av_lists = {}
+for word in keywords:
+    av_lists[word] = {}
+    for part in parts_plot:
+        av_lists[word][part] = []
+        for size in sizes:
+            # if size == 32:
+                # continue
+            av_lists[word][part].append(averages[word][part][size])
+
+
+
+# print(av_lists['Allreduce'])
+# print(av_lists['Allreduce'].values().shape)
+# print(sizes)
+# print(sizes.shape)
+
+for op in keywords:
+    fig, ax = plt.subplots()
+    ax.stackplot(sizes, av_lists[op].values(),
+                 labels=av_lists[op].keys(), alpha=0.8)
+    ax.legend(loc='upper left', reverse=True)
+    plt.gca().set_xscale('log', base=2)
+    ax.set_title(op + ' Execution time composition')
+    ax.set_xlabel('size[B]')
+    ax.set_ylabel('Latency us')
+    # add tick at every 200 million people
+    # ax.yaxis.set_minor_locator(mticker.MultipleLocator(.2))
+
+    plt.savefig(op + '_composition.png')
+
+# for i, (dict_name, sub_dict) in enumerate(results.items()):
+    # for j, (key, values) in enumerate(sub_dict.items()):
+        # sns.histplot(values, ax=axes[i, j], bins=20, stat='percent', kde=True)
+        # axes[i, j].set_title(f'{dict_name} - {key}')
+
+# plt.tight_layout()
+# plt.savefig("fullplot.png")
+
+            
diff --git a/integrations/pytorch_ddp/test/run.sh b/integrations/pytorch_ddp/test/run.sh
new file mode 100755
index 00000000..6482f9d1
--- /dev/null
+++ b/integrations/pytorch_ddp/test/run.sh
@@ -0,0 +1,140 @@
+#!/bin/bash
+
+#check working directory
+if [[ $(pwd) != *pytorch_ddp/test ]]; then
+	echo "ERROR: this script should only be run in the pytorch_ddp/test dir of the repo!"
+	exit 1
+fi
+
+if [[ -v ACCL_SCRIPT ]]; then
+    SCRIPT_NAME="$ACCL_SCRIPT"
+else
+    # SCRIPT_NAME="test-mnist.py -d True -n 2" # MNIST
+    # SCRIPT_NAME="test-resnet50.py -d True -n 2" # MNIST
+    SCRIPT_NAME=test-generic.py
+    # SCRIPT_NAME="test-imagenet.py -d True"
+    echo "Variable ACCL_SCRIPT not set. Assuming $SCRIPT_NAME"
+fi
+
+# state variables
+mkdir -p "$(pwd)/accl_log"
+# BUILD_DIR=../build
+# point this to python venv, which has the relevant libraries installed
+VENV_ACTIVATE=$(pwd)/../venv/bin/activate
+SETUP_SH=$(pwd)/../setup.sh
+SCRIPT=$(pwd)/$SCRIPT_NAME
+HOST_FILE=./accl_log/host
+FPGA_FILE=./accl_log/fpga
+
+#enter venv and run script
+EXEC="bash -c \"source $VENV_ACTIVATE && source $SETUP_SH  && python $SCRIPT"
+# EXEC="python $SCRIPT"
+
+
+#---------------Setting up vars-------------
+if [[ $ACCL_SIM -eq 1 ]]; then
+    echo "Starting in simulator mode. Make sure to start the emulator beforehand"
+    # ARG="-s -d True" #MNIST
+    ARG="-s "
+    
+    ACCL_COMMS="udp"
+
+    echo "assuming $ACCL_COMMS comms in simulator"
+
+    if [[ -v ACCL_NP ]]; then
+        NUM_PROCESS="$ACCL_NP"
+    else
+    	echo "Variable ACCL_NP not set. Enter num of processes:"
+	read -a NUM_PROCESS
+    fi
+
+    MASTER_IP="localhost"
+    MASTER_PORT="30505"
+
+else
+    echo "Starting in hw mode. Make sure to run flow_u55c beforehand."
+    if [[ -v U55C_IDS ]]; then
+	IFS=' ' read -r -a SERVID <<< "$U55C_IDS"
+    else
+	# read server ids from user
+	echo "Variable U55C_IDS not set. Enter u55c machine ids (space separated):"
+	read -a SERVID
+    fi
+
+    if ! [[ -v ACCL_COMMS ]]; then
+        ACCL_COMMS="cyt_rdma"
+	echo "Assuming cyt_rdma comms in hardware"
+    fi
+	
+    RANK_PORT="30501"
+    # create ip files
+    rm -f $HOST_FILE $FPGA_FILE
+    NUM_PROCESS=0
+    for ID in ${SERVID[@]}; do
+	echo "10.253.74.$(((ID-1) * 4 + 66))">>$HOST_FILE
+	echo "10.253.74.$(((ID-1) * 4 + 68))">>$FPGA_FILE
+ 	NUM_PROCESS=$((NUM_PROCESS+1))
+	HOST_LIST+="alveo-u55c-$(printf "%02d" $ID) "
+	HOST_PORT_LIST+="alveo-u55c-$(printf "%02d" $ID):$RANK_PORT "
+    done
+
+    echo "HOST_LIST: ${HOST_LIST[*]}"
+
+    #set master address
+    MASTER_IP="10.253.74.$(((${SERVID[0]}-1) * 4 + 66))"
+    MASTER_PORT="30505"
+
+    echo "Master node set to: $MASTER_IP:$MASTER_PORT"
+
+    # 09 and 10 have other interface names:
+    MPI_ARGS="-f $HOST_FILE --iface ens4f0"
+    # MPI_ARGS="-f $HOST_FILE --iface ens4"
+fi
+
+ARG="$ARG -c $ACCL_COMMS -i $HOST_FILE -f $FPGA_FILE -a $MASTER_IP -p $MASTER_PORT\""
+
+#---------------Running it-------------
+
+echo "Run command: $EXEC $ARG"
+
+echo "Running with $NUM_PROCESS Processes"
+
+rm -f $(pwd)/accl_log/rank*
+rm -f $(pwd)/accl_log/accl_pg_*
+rm -rf $(pwd)/accl_log/profiler_log
+
+# C="mpirun -n $NUM_PROCESS $MPI_ARGS -outfile-pattern \"$(pwd)/accl_log/rank_%r_stdout\" $EXEC $ARG &"
+C="mpirun -n $NUM_PROCESS $MPI_ARGS -outfile-pattern \"$(pwd)/accl_log/rank_%r_stdout\" -errfile-pattern \"$(pwd)/accl_log/rank_%r_stderr\" $EXEC $ARG &"
+# C="mpirun -n $NUM_PROCESS $MPI_ARGS $EXEC $ARG &"
+echo $C
+
+exit 0
+
+/bin/sh -c "$C"
+
+if ! [[ -v SLEEPTIME ]]; then
+    SLEEPTIME="32"
+fi
+echo "Sleeping for $SLEEPTIME"
+sleep $SLEEPTIME
+
+# if ! [[ $ACCL_SIM -eq 1 ]]; then
+    # parallel-ssh -H "$HOST_LIST" "killall -9 $SCRIPT_NAME"
+    # parallel-ssh -H "$HOST_LIST" "dmesg | grep "fpga_tlb_miss_isr" >$(pwd)/accl_log/tlb_miss.log"
+# else
+    # killall -9 $SCRIPT_NAME
+    # dmesg | grep "fpga_tlb_miss_isr" >$(pwd)/accl_log/tlb_miss.log
+# fi
+
+# mkdir -p "$(pwd)/accl_results"
+# # Loop through accl log files in the source directory and append to accl_results folder
+# for source_log in "$(pwd)/accl"*.log; do
+#     # Extract the log number from the source log file name (assuming the format is acclX.log)
+#     log_number=$(basename "${source_log}" | sed 's/accl\([0-9]*\)\.log/\1/')
+#     # Create the destination log file path
+#     destination_log="$(pwd)/accl_results/accl${log_number}.log"
+#     # Append the content of the source log to the destination log
+#     cat "${source_log}" >> "${destination_log}"
+#     # Remove the tmp log
+#     rm ${source_log}
+# done
diff --git a/integrations/pytorch_ddp/test/test-generic.py b/integrations/pytorch_ddp/test/test-generic.py
new file mode 100644
index 00000000..ad0af45f
--- /dev/null
+++ b/integrations/pytorch_ddp/test/test-generic.py
@@ -0,0 +1,619 @@
+# /*****************************************************************************
+#  Copyright (C) 2023 Advanced Micro Devices, Inc
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+# *****************************************************************************/
+
+from __future__ import annotations
+from typing import Optional
+import numpy as np
+import os
+import sys
+import logging
+import time
+from mpi4py.MPI import COMM_WORLD as mpi
+
+import torch
+import torch.distributed as dist
+from torch.profiler import profile, ProfilerActivity
+import accl_process_group as accl
+
+from torch.nn.parallel import DistributedDataParallel as DDP
+import torch.nn as nn
+import torch.optim as optim
+
+from torch.utils.data import Dataset, DataLoader
+from torch.utils.data.distributed import DistributedSampler
+
+#Configure logging
+logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
+
+logger = logging.getLogger(__name__)
+
+if "ACCL_DEBUG" in os.environ and os.environ["ACCL_DEBUG"]=="1":
+    logger.setLevel(logging.DEBUG)
+else:
+    logger.setLevel(logging.WARNING)
+    
+rank = 0
+size = 0
+
+x = 1024
+y = 1
+
+seed = 48
+torch.manual_seed(seed)
+
+count = x * y
+num_el = x * y
+shape = (x , y)
+#As in test.cpp defaults
+rxbufsize = 4096 * 1024
+
+
+
+def test_broadcast(numel, testtype):
+    shape = (numel,)
+
+    # testtype = torch.float32
+    global num_errors
+
+    if testtype == torch.int64 or testtype == torch.int32:
+        rand_torch = torch.randint(torch.iinfo(testtype).min, torch.iinfo(testtype).max,shape, dtype=testtype)
+        # rand_torch = torch.ones(shape, dtype=testtype)
+    else:
+        rand_torch = torch.rand(shape, dtype=testtype)
+    
+    # for i in range(10):
+    if True:
+
+        if rank == 0:
+            x = rand_torch.clone()
+        else:
+            x = torch.ones(shape, dtype=testtype)
+
+        mpi.Barrier()            
+        
+        with torch.profiler.record_function("test bcast "):
+
+            start_time = time.perf_counter()
+
+            dist.broadcast(x, 0)
+
+            end_time = time.perf_counter()
+            
+        measured_time = (end_time - start_time) * 1000000
+
+        print(str(rank) + "_pytorch_Broadcast_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+        
+        mpi.Barrier()
+
+        end_time = time.perf_counter()
+
+        measured_time = (end_time - start_time) * 1000000
+
+
+    try:
+        np.testing.assert_allclose(x, rand_torch)
+    except AssertionError as e:
+        num_errors = num_errors + 1
+        logger.debug("Test Broadcast failed")
+        logger.debug(str(e))
+    else:
+        logger.debug("Test broadcast finished!")
+
+def test_sendrcv(numel):
+    global num_errors
+
+    shape = (numel,)
+    x = torch.full(shape, float(rank))
+
+    y = torch.empty(shape)
+
+    prev_rank = (rank - 1) % size
+    next_rank = (rank + 1) % size
+
+
+    with torch.profiler.record_function("test_sendrcv"):
+        if rank % 2:
+            mpi.Barrier()            
+            start_time = time.perf_counter()
+            dist.send(x, next_rank)
+            end_time = time.perf_counter()
+            measured_time = (end_time - start_time) * 1000000
+            print(str(rank) + "_pytorch_Send_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+
+            mpi.Barrier()            
+            start_time = time.perf_counter()
+            dist.recv(y, prev_rank)
+            end_time = time.perf_counter()
+            measured_time = (end_time - start_time) * 1000000
+            print(str(rank) + "_pytorch_Recv_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+        else:
+            mpi.Barrier()            
+            start_time = time.perf_counter()
+            dist.recv(y, prev_rank)
+            end_time = time.perf_counter()
+            measured_time = (end_time - start_time) * 1000000
+            print(str(rank) + "_pytorch_Recv_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+
+            mpi.Barrier()            
+            start_time = time.perf_counter()
+            dist.send(x, next_rank)
+            end_time = time.perf_counter()
+            measured_time = (end_time - start_time) * 1000000
+            print(str(rank) + "_pytorch_Send_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+        mpi.Barrier()
+    try:
+        np.testing.assert_allclose(y, torch.full(shape, prev_rank))
+    except AssertionError as e:
+        num_errors = num_errors + 1
+        logger.debug("Test Sendrcv failed")
+        logger.debug(str(e))
+    else:
+        logger.debug("Test Sendrcv finished!")
+
+
+def test_scatter(numel):
+    global num_errors
+
+    shape = (numel,)
+    if rank == 0:
+        x = [torch.full(shape, float(i+1)) for i in range(size)]
+    else:
+        x = None
+    y = torch.full(shape, float(0))
+
+    mpi.Barrier()            
+    start_time = time.perf_counter()
+    
+    with torch.profiler.record_function("test_scatter"):
+        
+        dist.scatter(y, x, 0)
+
+    end_time = time.perf_counter()
+    measured_time = (end_time - start_time) * 1000000
+    print(str(rank) + "_pytorch_Scatter_" + str(y.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+    
+    try:
+        np.testing.assert_allclose(y, torch.full(shape, float(rank+1)))
+    except AssertionError as e:
+        num_errors = num_errors + 1
+        logger.debug("Test Scatter failed")
+        logger.debug(str(e))
+    else:
+        logger.debug("Test Scatter finished!")
+    
+
+
+def test_gather(numel):
+    global num_errors
+
+    shape = (numel,)
+    x = torch.full(shape, float(rank))
+
+    if rank == 0:
+        y = [torch.empty(shape) for _ in range(size)]
+    else:
+        y = None
+
+    mpi.Barrier()            
+    start_time = time.perf_counter()
+        
+    with torch.profiler.record_function("test_gather"):
+            
+        dist.gather(x, y, 0)
+
+    end_time = time.perf_counter()
+    measured_time = (end_time - start_time) * 1000000
+    print(str(rank) + "_pytorch_Gather_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+    
+    if rank == 0:
+        for i, c in enumerate(y):
+            try:
+                np.testing.assert_allclose(c, torch.full(shape, float(i)))
+            except AssertionError as e:
+                num_errors = num_errors + 1
+                logger.debug("Test Gather failed")
+                logger.debug(str(e))
+            else:
+                logger.debug("Test Gather finished!")
+
+            
+def test_allgather(numel, testtype):
+    global num_errors
+
+    shape = (numel,)
+    if testtype == torch.int64 or testtype == torch.int32:
+        rand_torch = torch.randint(torch.iinfo(testtype).min, torch.iinfo(testtype).max,shape, dtype=testtype)
+    else:
+        rand_torch = torch.rand(shape, dtype=testtype)
+    x = rand_torch.clone()
+    y = [torch.full(shape, 0, dtype=testtype) for _ in range(size)]
+
+    mpi.Barrier()            
+    start_time = time.perf_counter()
+
+    print('len y:' + str(len(y)))
+    
+    with torch.profiler.record_function("test_allgather"):
+        dist.all_gather(y, x)
+
+    end_time = time.perf_counter()
+    measured_time = (end_time - start_time) * 1000000
+    print(str(rank) + "_pytorch_Allgather_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+        
+    mpi.Barrier()
+
+        
+    for i, c in enumerate(y):
+        try:
+            np.testing.assert_allclose(c, rand_torch)
+        except AssertionError as e:
+            num_errors = num_errors + 1
+            logger.debug("Test AllGather failed")
+            logger.debug(str(e))
+        else:
+            logger.debug("Test AllGather finished!")
+        
+
+
+def test_reduce(numel):
+    global num_errors
+
+
+    shape = (numel,)
+    x = torch.ones(shape)
+
+    mpi.Barrier()            
+    start_time = time.perf_counter()
+    with torch.profiler.record_function("test_reduce"):
+
+        dist.reduce(x, 0, dist.ReduceOp.SUM)
+        mpi.Barrier()
+
+    end_time = time.perf_counter()
+    measured_time = (end_time - start_time) * 1000000
+    print(str(rank) + "_pytorch_Reduce_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+    
+    if rank == 0:
+        try:
+            np.testing.assert_allclose(x, torch.full(shape, float(size)))
+        except AssertionError as e:
+            num_errors = num_errors + 1
+            logger.debug("Test Reduce failed")
+            logger.debug(str(e))
+        else:
+            logger.debug("Test Reduce finished!")
+        
+
+def test_allreduce(numel, testtype):
+
+    global num_errors
+
+    shape = (numel,)
+
+    
+    if testtype == torch.int64 or testtype == torch.int32:
+        rand_torch = torch.randint(torch.iinfo(testtype).min//size, torch.iinfo(testtype).max//size,shape, dtype=testtype)
+    else:
+        rand_torch = torch.rand(shape, dtype=testtype)
+    
+    # for i in range(10):
+    if True:
+    
+        # shape = (320001,)
+        x = rand_torch.clone()
+
+        mpi.Barrier()            
+        
+        start_time = time.perf_counter()
+
+        
+        with torch.profiler.record_function("test_allreduce"):
+
+            dist.all_reduce(x, dist.ReduceOp.SUM)
+
+        end_time = time.perf_counter()
+        measured_time = (end_time - start_time) * 1000000
+        print(str(rank) + "_pytorch_Allreduce_" + str(x.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+        
+        logger.debug("Directly measured time us 1:" + str(measured_time))            
+        
+        mpi.Barrier()
+
+        try:
+            np.testing.assert_allclose(x, rand_torch * size)
+        except AssertionError as e:
+            num_errors = num_errors + 1
+            logger.debug("Test AllReduce failed")
+            logger.debug(str(e))
+        else:
+            logger.debug("Test AllReduce finished!")
+        
+    
+def test_alltoall(numel):
+    global num_errors
+
+    # num_el = 26624
+    
+    shape = (numel,)
+
+    input = torch.arange(numel, dtype=torch.float) + float(rank) * numel
+
+    input_shaped = input.reshape(shape)
+
+    output = torch.ones(numel)
+
+    output_shaped = output.reshape(shape)
+
+    start_time = time.perf_counter()
+    
+    with torch.profiler.record_function("test_alltoall"):
+        
+        dist.all_to_all_single(output_shaped, input_shaped)
+
+    end_time = time.perf_counter()
+
+    measured_time = (end_time - start_time) * 1000000
+    
+    print(str(rank) + "_pytorch_AlltoAll_" + str(input.nbytes) + " durationUs: " + str(measured_time), file=sys.stderr)
+        
+    test = torch.zeros(numel)
+
+    section_size = int(numel/size)
+
+    for section in range(size):
+        for el in range(section_size):
+            test[section * section_size + el] = float(rank) * section_size + section * numel + el
+
+    test_shaped = test.reshape(shape)
+    try:
+        np.testing.assert_allclose(output_shaped, test_shaped)
+    except AssertionError as e:
+        num_errors = num_errors + 1
+        logger.debug("Test AlltoAll failed")
+        logger.debug(str(e))
+    else:
+        logger.debug("Test AlltoAll finished!")
+        
+class ToyModel(nn.Module):
+    def __init__(self):
+        super(ToyModel, self).__init__()
+        self.net1 = nn.Linear(10, 10)
+        self.relu = nn.ReLU()
+        self.net2 = nn.Linear(10, 5)
+
+    def forward(self, x):
+        return self.net2(self.relu(self.net1(x)))
+
+class MyTrainDataset(Dataset):
+    def __init__(self, size):
+        self.size = size
+
+        self.data = []
+        for i in range(size):
+            in_feature = torch.zeros(10)
+            out_feature = torch.zeros(5)
+            for j in range(10):
+                in_feature[j] = float((i^2  + j) % 5)
+                # try to learn a linear function of the input, to make sure it's parameterizable
+                out_feature[j//2] = out_feature[j//2] + float(((i^2 + j) % 5) * 3 * ( -1 ** (j % 2)))
+            self.data.append((in_feature, out_feature))
+                
+                
+        
+
+    def __len__(self):
+        return self.size
+    
+    def __getitem__(self, index):
+        return self.data[index]
+    
+
+def prepare_dataloader(dataset: Dataset, batch_size: int):
+    return DataLoader(
+        dataset,
+        batch_size=batch_size,
+        pin_memory=True,
+        shuffle=False,
+        sampler=DistributedSampler(dataset)
+    )    
+    
+def demo_basic(rank: int):
+
+    with torch.profiler.record_function("basic 2 Layer NN"):
+        model = ToyModel()
+        ddp_model = DDP(model, bucket_cap_mb=4)
+        # ddp_model = DDP(model, bucket_cap_mb=4, broadcast_buffers=False)
+        
+        train_set = MyTrainDataset(2048)  # load your dataset
+        batch_size=64
+        train_data = prepare_dataloader(train_set, batch_size)
+
+        loss_fn = nn.MSELoss()
+        optimizer = optim.Adam(ddp_model.parameters(), lr=0.005)
+
+        max_epochs = 10
+        for epoch in range(max_epochs):
+            batch_size = len(next(iter(train_data))[0])
+            train_data.sampler.set_epoch(epoch)
+            for x, y in train_data:
+
+                optimizer.zero_grad()
+                outputs = ddp_model(x)
+                loss = loss_fn(outputs, y)
+                loss.backward()
+                optimizer.step()
+
+            print(f"Rank {rank}: Epoch {epoch} | Batchsize: {batch_size} | Steps: {len(train_data)} | Loss: {loss}")
+
+
+        print("finished training")
+        mpi.Barrier()
+    # print("final params:")
+    # print(ddp_model)
+    # dist.destroy_process_group()
+
+def start_test(comms: str, simulator: bool, host_file: str=None, fpga_file: str=None, ma: str="localhost", mp: str="30505"):
+    global rank, size
+    if ma==None:
+        ma = "localhost"
+    if mp==None:
+        mp = "30505"
+    os.environ['MASTER_ADDR'] = ma
+    os.environ['MASTER_PORT'] = mp
+
+    rank = mpi.Get_rank()
+    size = mpi.Get_size()
+    start_port = 5005
+    logger.debug(f"Starting tests with the following parameters:\n\
+Simulation: {simulator}, Communication Backend: {comms}\n\
+Rank: {rank}, World size: {size}\n\
+Host file: {host_file}, FPGA file: {fpga_file}\n\
+Master address: {ma}:{mp}, Start port for FPGA: {start_port}")
+    
+
+    if not simulator:
+        #default from test.cpp
+        rxbufsize = 4096 * 1024
+        if host_file==None or fpga_file==None: sys.exit('Host and FPGA file need to be specified in hardware mode')
+            
+        with open(host_file, 'r') as hf:
+            host_ips = hf.read().splitlines()
+            
+        with open(fpga_file, 'r') as ff:
+            fpga_ips = ff.read().splitlines()
+
+        if comms == "cyt_rdma":
+            ranks = [accl.Rank(a, start_port, i, rxbufsize) for i, a in enumerate(fpga_ips)]
+        else:
+            ranks = [accl.Rank(a, start_port + i, 0, rxbufsize) for i, a in enumerate(fpga_ips)]
+    else:
+        # Somehow the simulator gets stuck if I use the same rxbufsize
+        rxbufsize = 4096 # * 1024
+        ranks = [accl.Rank("127.0.0.1", 5500 + i, i, rxbufsize) for i in range(size)]
+
+    logger.debug(f'Ranks: {ranks}')
+
+    if comms == 'udp':
+        design = accl.ACCLDesign.udp
+    elif comms == 'tcp':
+        design = accl.ACCLDesign.tcp
+    elif comms == 'cyt_rdma': # and not simulator:
+        design = accl.ACCLDesign.cyt_rdma
+    # else:
+        # if simulator:
+            # sys.exit('Design "' + comms + '" currently not supported in simulator mode')
+        # else:
+            # sys.exit('Design "' + comms + '" currently not supported in hardware mode')
+
+    # Sometimes ACCL gets stuck on the mpi import statement, so this is to avoid issues:
+    mpi.Barrier()            
+
+
+    # dist.init_process_group("mpi", rank=rank, world_size=size)
+
+    
+    accl.create_process_group(ranks, design, bufsize=rxbufsize, initialize=True, simulation=simulator)
+    dist.init_process_group("ACCL", rank=rank, world_size=size)
+    
+    global num_errors
+    num_errors = 0
+
+    test_allreduce(256, torch.float32)
+    test_broadcast(256, torch.float32)
+
+    schedule = torch.profiler.schedule(
+        wait=1,
+        warmup=2,
+        active=5,
+    )
+    
+    # with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True, schedule=schedule, record_shapes=True) as prof:
+
+    # generic testing
+    for n in range(9,20)
+        for i in range(40):
+            num = 2**n * 3
+            test_broadcast(num, torch.float32)
+            test_allreduce(num, torch.float32)
+            test_alltoall(num)
+            test_allgather(num, torch.float32)
+            # test_sendrcv(num)
+            # test_scatter(num)
+            test_gather(num)
+            test_reduce(num)
+            
+            # prof.step()
+
+    # to simulate resnet behaviour(check to make sure it's the same as in your resnet config)
+    # for i in range(10):
+    test_resnet_sim = False
+    if test_resnet_sim:
+        test_allreduce(256, torch.int32)
+        test_allreduce(256, torch.int64)
+        test_broadcast(256, torch.float32)
+        for i in range(5):
+            test_allreduce(1000, torch.float32)
+            test_allreduce(2052096, torch.float32)
+            test_allreduce(1049600, torch.float32)
+            test_broadcast(256 * 1024, torch.float32)
+            test_allreduce(256 * 1024, torch.float32)        
+            test_broadcast(53, torch.int64)
+            test_broadcast(53120, torch.float32)
+            test_broadcast(53, torch.int64)
+            test_broadcast(162, torch.int32)
+            test_broadcast(25, torch.int32)
+            test_allreduce(8196000, torch.float32)
+
+    test_NN = False
+    if test_NN:
+        demo_basic(rank)
+
+
+    mpi.Barrier()
+
+    if num_errors == 0:
+        print("======== Successfully Finished testing======")
+        logger.debug("======== Successfully Finished testing======")
+    else:
+        print(f"!!!!!!!! - {num_errors} Errors found - !!!!!!!!!")
+        logger.debug(f"!!!!!!!! - {num_errors} Errors found - !!!!!!!!!")        
+
+    # print(prof.key_averages(group_by_input_shape=True)
+          # .table(sort_by="cpu_time_total", row_limit=15))
+
+        
+    logger.debug('Destroying ACCL Process Group')
+    dist.destroy_process_group()
+
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser(description='Coyote tests for ACCL ProcessGroup')
+    parser.add_argument('-s', '--simulation', action='store_true',
+                        default=False, help='Use simulation instead of '
+                                            'hardware')
+    parser.add_argument('-c', '--comms', choices=['udp', 'tcp', 'cyt_rdma'], default='tcp',
+                        help='Run tests over specified communication backend')
+    parser.add_argument('-i', '--host-file', type=str, help='Specify the file, where the host IPs are listed')
+    parser.add_argument('-f', '--fpga-file', type=str, help='Specify the file, where the FPGA IPs are listed')
+    parser.add_argument('-a','--master-address', type=str)
+    parser.add_argument('-p','--master-port', type=str)
+    args = parser.parse_args()
+
+    #if args.comms != 'cyt_rdma' or not args.simulation:
+    #if args.comms != 'cyt_rdma':
+    #    sys.exit('Currently only supports -c cyt_rdma and -s flags')
+    start_test(args.comms, args.simulation, args.host_file, args.fpga_file, args.master_address, args.master_port)
diff --git a/integrations/pytorch_ddp/test/test-imagenet.py b/integrations/pytorch_ddp/test/test-imagenet.py
new file mode 100644
index 00000000..5e725035
--- /dev/null
+++ b/integrations/pytorch_ddp/test/test-imagenet.py
@@ -0,0 +1,299 @@
+import torch
+import torch.nn as nn
+from torch import optim
+from torch.optim import lr_scheduler
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+from torch.utils.data import DataLoader
+from torch.autograd import Variable
+import torchvision
+from torchvision import datasets, models, transforms
+import torch.distributed as dist
+import accl_process_group as accl
+
+from mpi4py.MPI import COMM_WORLD as mpi
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+
+
+import numpy as np
+import matplotlib.pyplot as plt
+import time
+import argparse
+import os
+import sys
+import logging
+from PIL import Image
+from tempfile import TemporaryDirectory
+
+logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
+
+logger = logging.getLogger(__name__)
+
+if "ACCL_DEBUG" in os.environ and os.environ["ACCL_DEBUG"]=="1":
+    logger.setLevel(logging.DEBUG)
+else:
+    logger.setLevel(logging.WARNING)
+
+# Run via ACCL
+global best_model_params_path
+
+best_model_params_path = './best_model_params.pt'
+
+
+class CNN(nn.Module):
+    def __init__(self):
+        super(CNN, self).__init__()
+        self.conv1 = nn.Sequential(         
+            nn.Conv2d(
+                in_channels=1,              
+                out_channels=16,            
+                kernel_size=5,              
+                stride=1,                   
+                padding=2,                  
+            ),                              
+            nn.ReLU(),                      
+            nn.MaxPool2d(kernel_size=2),    
+        )
+        self.conv2 = nn.Sequential(         
+            nn.Conv2d(16, 32, 5, 1, 2),     
+            nn.ReLU(),                      
+            nn.MaxPool2d(2),                
+        )
+        # fully connected layer, output 10 classes
+        self.out = nn.Linear(32 * 7 * 7, 10)
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.conv2(x)
+        # flatten the output of conv2 to (batch_size, 32 * 7 * 7)
+        x = x.view(x.size(0), -1)       
+        output = self.out(x)
+        return output, x    # return x for visualization
+
+def train(model, criterion, optimizer, scheduler, num_epochs=25):
+    since = time.time()
+
+    global rank
+    # Create a temporary directory to save training checkpoints
+    with TemporaryDirectory() as tempdir:
+
+        torch.save(model.state_dict(), best_model_params_path)
+        best_acc = 0.0
+
+        print("train len: " + str(len(dataloaders['train'])))
+        print("val len: " + str(len(dataloaders['val'])))
+        
+        for epoch in range(num_epochs):
+            print(f'Epoch {epoch}/{num_epochs - 1}')
+            print('-' * 10)
+
+            # Each epoch has a training and validation phase
+            for phase in ['train', 'val']:
+                if phase == 'train':
+                    model.train()  # Set model to training mode
+                else:
+                    model.eval()   # Set model to evaluate mode
+
+                running_loss = 0.0
+                running_corrects = 0
+
+                # Iterate over data.
+                count = 0
+
+                
+                for inputs, labels in dataloaders[phase]:
+                    inputs = inputs.to(device)
+                    labels = labels.to(device)
+
+                    # zero the parameter gradients
+                    optimizer.zero_grad()
+
+                    # forward
+                    # track history if only in train
+                    with torch.set_grad_enabled(phase == 'train'):
+                        outputs = model(inputs)
+                        _, preds = torch.max(outputs, 1)
+                        loss = criterion(outputs, labels)
+
+                        # backward + optimize only if in training phase
+                        if phase == 'train':
+                            loss.backward()
+                            optimizer.step()
+
+                    # statistics
+                    running_loss += loss.item() * inputs.size(0)
+                    running_corrects += torch.sum(preds == labels.data)
+
+                    print(f'{phase} batch Loss: {loss.item():.4f}')
+                    logger.debug(f'{phase} batch Loss: {loss.item():.4f}')
+                    
+                    if count % 5 == 0 and rank == 0:
+                        print("saving model to " + best_model_params_path)
+                        torch.save(model.state_dict(), best_model_params_path)
+                    
+                    count += 1
+                    
+                if phase == 'train':
+                    scheduler.step()
+
+                epoch_loss = running_loss / dataset_sizes[phase]
+                epoch_acc = running_corrects.double() / dataset_sizes[phase]
+
+                print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
+                logger.debug(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
+                # deep copy the model
+                if phase == 'val' and epoch_acc > best_acc:
+                    best_acc = epoch_acc
+                    torch.save(model.state_dict(), best_model_params_path)
+
+            print()
+
+        time_elapsed = time.time() - since
+        print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
+        print(f'Best val Acc: {best_acc:4f}')
+
+        # load best model weights
+        model.load_state_dict(torch.load(best_model_params_path))
+    return model
+
+
+    
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-n", type=int, default=1)
+    parser.add_argument("-d", type=bool, default=None)
+
+
+    parser.add_argument('-s', '--simulator', action='store_true',
+                        default=False, help='Use simulation instead of '
+                                            'hardware')
+    parser.add_argument('-c', '--comms', choices=['udp', 'tcp', 'cyt_rdma'], default='tcp',
+                        help='Run tests over specified communication backend')
+    parser.add_argument('-i', '--host-file', type=str, help='Specify the file, where the host IPs are listed')
+    parser.add_argument('-f', '--fpga-file', type=str, help='Specify the file, where the FPGA IPs are listed')
+    parser.add_argument('-a','--master-address', type=str)
+    parser.add_argument('-p','--master-port', type=str)
+
+
+    args = parser.parse_args()
+
+    if args.n == 1 and args.d == None :
+        print("only one machine specified. Assuming Non distributed setup")
+        args.d = False
+    elif args.n > 1 and args.d == None:
+        print("Assung DDP setup")
+        args.d = True
+
+    
+    global rank, size
+    if args.master_address==None:
+        args.master_address = "localhost"
+    if args.master_port==None:
+        args.master_port = "30505"
+    os.environ['MASTER_ADDR'] = args.master_address
+    os.environ['MASTER_PORT'] = args.master_port
+    rank = mpi.Get_rank()
+    size = mpi.Get_size()
+
+    host_file = args.host_file
+    fpga_file = args.fpga_file
+    comms = args.comms
+    start_port = 5005
+
+    rxbufsize = 4096 * 1024
+
+    if args.d:
+        if not args.simulator:
+            #default from test.cpp
+            rxbufsize = 4096 * 1024
+            if host_file==None or fpga_file==None: sys.exit('Host and FPGA file need to be specified in hardware mode')
+        
+            with open(host_file, 'r') as hf:
+                host_ips = hf.read().splitlines()
+            
+            with open(fpga_file, 'r') as ff:
+                fpga_ips = ff.read().splitlines()
+
+            if comms == "cyt_rdma":
+                ranks = [accl.Rank(a, start_port, i, rxbufsize) for i, a in enumerate(fpga_ips)]
+            else:
+                ranks = [accl.Rank(a, start_port + i, 0, rxbufsize) for i, a in enumerate(fpga_ips)]
+        else:
+            rxbufsize = 4096 * 1024
+            ranks = [accl.Rank("127.0.0.1", 5500 + i, i, rxbufsize) for i in range(size)]
+
+        logger.debug(f'Ranks: {ranks}')
+
+        if args.comms == 'udp':
+            design = accl.ACCLDesign.udp
+        elif args.comms == 'tcp':
+            design = accl.ACCLDesign.tcp
+        elif args.comms == 'cyt_rdma': # and not simulator:
+            design = accl.ACCLDesign.cyt_rdma
+    
+
+        mpi.Barrier()            
+    
+        accl.create_process_group(ranks, design, bufsize=rxbufsize, initialize=True, simulation=args.simulator)
+        dist.init_process_group("ACCL", rank=rank, world_size=size)
+        
+    device = 'cpu'
+
+
+    data_transforms = {
+        'train': transforms.Compose([
+            transforms.RandomResizedCrop(224),
+            transforms.RandomHorizontalFlip(),
+            transforms.ToTensor(),
+            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+        ]),
+        'val': transforms.Compose([
+            transforms.Resize(256),
+            transforms.CenterCrop(224),
+            transforms.ToTensor(),
+            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+        ]),
+    }
+
+    data_dir = 'imagenet-data/hymenoptera_data'
+    image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
+                                              data_transforms[x])
+                      for x in ['train', 'val']}
+
+    if args.d : sampler = DistributedSampler
+    else : sampler = lambda x : None
+
+    dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
+                                                  shuffle=False, num_workers=4, sampler=sampler(image_datasets[x]))
+              for x in ['train', 'val']}
+    dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
+    class_names = image_datasets['train'].classes
+    
+    model_ft = models.resnet50(weights='IMAGENET1K_V1')
+    
+    num_ftrs = model_ft.fc.in_features
+
+    model_ft.fc = nn.Linear(num_ftrs, 2)
+
+    if args.d : model_ft = DDP(model_ft, bucket_cap_mb=2)
+
+    criterion = nn.CrossEntropyLoss()
+
+    # Observe that all parameters are being optimized
+    optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)
+
+    # Decay LR by a factor of 0.1 every 7 epochs
+    exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
+
+    loss_func = nn.CrossEntropyLoss()
+
+    best_model_params_path = './best_model_params.pt'
+
+    # model_ft.load_state_dict(torch.load(best_model_params_path))
+
+    model_ft = train(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
+                       num_epochs=25)
+
+
diff --git a/integrations/pytorch_ddp/test/test-mnist.py b/integrations/pytorch_ddp/test/test-mnist.py
new file mode 100644
index 00000000..0256b017
--- /dev/null
+++ b/integrations/pytorch_ddp/test/test-mnist.py
@@ -0,0 +1,275 @@
+import torch
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+from torch.utils.data import DataLoader
+from torch.profiler import profile, ProfilerActivity
+import torch.nn as nn
+from torch import optim
+from torch.autograd import Variable
+import torch.distributed as dist
+import accl_process_group as accl
+
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+
+import argparse
+import os
+import sys
+import logging
+import time
+
+logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
+
+logger = logging.getLogger(__name__)
+
+if "ACCL_DEBUG" in os.environ and os.environ["ACCL_DEBUG"]=="1":
+    logger.setLevel(logging.DEBUG)
+else:
+    logger.setLevel(logging.WARNING)
+
+# Run via ACCL
+
+class CNN(nn.Module):
+    def __init__(self):
+        super(CNN, self).__init__()
+        self.conv1 = nn.Sequential(         
+            nn.Conv2d(
+                in_channels=1,              
+                out_channels=16,            
+                kernel_size=5,              
+                stride=1,                   
+                padding=2,                  
+            ),                              
+            nn.ReLU(),                      
+            nn.MaxPool2d(kernel_size=2),    
+        )
+        self.conv2 = nn.Sequential(         
+            nn.Conv2d(16, 32, 5, 1, 2),     
+            nn.ReLU(),                      
+            nn.MaxPool2d(2),                
+        )
+        # fully connected layer, output 10 classes
+        self.out = nn.Linear(32 * 7 * 7, 10)
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.conv2(x)
+        # flatten the output of conv2 to (batch_size, 32 * 7 * 7)
+        x = x.view(x.size(0), -1)       
+        output = self.out(x)
+        return output, x    # return x for visualization
+
+def train(num_epochs, cnn, loaders):
+
+    start_time_train = time.perf_counter()
+    
+    cnn.train()
+        
+    # Train the model
+    total_step = len(loaders['train'])
+
+    optimizer = optim.Adam(cnn.parameters(), lr = 0.01)   
+
+    for epoch in range(num_epochs):
+        for i, (images, labels) in enumerate(loaders['train']):
+            # p.step()
+            start_time = time.perf_counter()
+            # gives batch data, normalize x when iterate train_loader
+            b_x = Variable(images)   # batch x
+            b_y = Variable(labels)   # batch y
+            output = cnn(b_x)[0]               
+
+            loss = loss_func(output, b_y)
+            
+            # clear gradients for this training step   
+            optimizer.zero_grad()           
+            
+            # backpropagation, compute gradients 
+            loss.backward()    
+            # apply gradients             
+            optimizer.step()                
+            
+            # if (i+1) % 100 == 0:
+            if True:
+                end_time = time.perf_counter()
+                measured_time = (end_time - start_time) * 1000000
+                print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Time(us): {}' 
+                       .format(epoch + 1, num_epochs, i + 1, total_step, loss.item(), measured_time))
+
+    end_time_train = time.perf_counter()
+    measured_time_train = (end_time_train - start_time_train) * 1000000
+
+    print('Total train time: ' + str(measured_time_train))
+        
+
+def test():
+    # Test the model
+    start_time_test = time.perf_counter()
+    cnn.eval()
+    with torch.no_grad():
+        correct = 0
+        total = 0
+        for images, labels in loaders['test']:
+            # p.step()
+            test_output, last_layer = cnn(images)
+            pred_y = torch.max(test_output, 1)[1].data.squeeze()
+            correct_current = (pred_y == labels).sum().item()
+            total += labels.size(0)
+            correct += correct_current
+            
+            print(f'Test Batch accuracy: {correct_current}/{labels.size(0)} {correct_current/float(labels.size(0))}')
+
+
+    end_time_test = time.perf_counter()
+    measured_time_test = (end_time_test - start_time_test) * 1000000
+
+    print('Total test time: ' + str(measured_time_test))            
+    print(f'Total accuracy: {correct}/{total} {correct/float(total)}')
+    
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-n", type=int, default=1)
+    parser.add_argument("-d", type=bool, default=None)
+
+
+    parser.add_argument('-s', '--simulator', action='store_true',
+                        default=False, help='Use simulation instead of '
+                                            'hardware')
+    parser.add_argument('-c', '--comms', choices=['udp', 'tcp', 'cyt_rdma'], default='tcp',
+                        help='Run tests over specified communication backend')
+    parser.add_argument('-i', '--host-file', type=str, help='Specify the file, where the host IPs are listed')
+    parser.add_argument('-f', '--fpga-file', type=str, help='Specify the file, where the FPGA IPs are listed')
+    parser.add_argument('-a','--master-address', type=str)
+    parser.add_argument('-p','--master-port', type=str)
+
+
+    args = parser.parse_args()
+
+    if args.n == 1 and args.d == None :
+        print("only one machine specified. Assuming Non distributed setup")
+        args.d = False
+    elif args.n > 1 and args.d == None:
+        print("Assung DDP setup")
+        args.d = True
+
+
+    host_file = args.host_file
+    fpga_file = args.fpga_file
+    comms = args.comms
+    start_port = 5005
+    
+    global rank, size
+    if args.master_address==None:
+        args.master_address = "localhost"
+    if args.master_port==None:
+        args.master_port = "30505"
+    os.environ['MASTER_ADDR'] = args.master_address
+    os.environ['MASTER_PORT'] = args.master_port
+
+    dist.init_process_group("mpi")
+    rank = dist.get_rank()
+    size = dist.get_world_size()
+
+
+    rxbufsize = 4096 * 1024
+
+    if args.d:
+        if not args.simulator:
+            #default from test.cpp
+            rxbufsize = 4096 * 1024
+            if host_file==None or fpga_file==None: sys.exit('Host and FPGA file need to be specified in hardware mode')
+        
+            with open(host_file, 'r') as hf:
+                host_ips = hf.read().splitlines()
+            
+            with open(fpga_file, 'r') as ff:
+                fpga_ips = ff.read().splitlines()
+
+            if comms == "cyt_rdma":
+                ranks = [accl.Rank(a, start_port, i, rxbufsize) for i, a in enumerate(fpga_ips)]
+            else:
+                ranks = [accl.Rank(a, start_port + i, 0, rxbufsize) for i, a in enumerate(fpga_ips)]
+        else:
+            # Somehow the simulator gets stuck if I use the same rxbufsize
+            rxbufsize = 4096 * 1024
+            ranks = [accl.Rank("127.0.0.1", 5500 + i, i, rxbufsize) for i in range(size)]
+
+        logger.debug(f'Ranks: {ranks}')
+
+        if args.comms == 'udp':
+            design = accl.ACCLDesign.udp
+        elif args.comms == 'tcp':
+            design = accl.ACCLDesign.tcp
+        elif args.comms == 'cyt_rdma': # and not simulator:
+            design = accl.ACCLDesign.cyt_rdma
+    
+
+    
+        # dist.init_process_group("mpi", rank=rank, world_size=size)
+        
+        # accl.create_process_group(ranks, design, bufsize=rxbufsize, initialize=True, simulation=args.simulator)
+        # dist.init_process_group("ACCL", rank=rank, world_size=size)
+        
+    device = 'cpu'
+
+    train_data = datasets.MNIST(
+        root = 'data',
+        train = True,                         
+        transform = ToTensor(), 
+        download = True,            
+    )
+    test_data = datasets.MNIST(
+        root = 'data', 
+        train = False, 
+        transform = ToTensor()
+    )
+
+    if args.d : sampler = DistributedSampler
+    else : sampler = lambda x : None
+    
+    loaders = {
+        'train' : torch.utils.data.DataLoader(train_data, 
+                                              batch_size=100, 
+                                              shuffle=False,
+                                              sampler=sampler(train_data)),
+        'test'  : torch.utils.data.DataLoader(test_data, 
+                                              batch_size=100, 
+                                              shuffle=False,
+                                              sampler=sampler(test_data)),
+    }
+
+    cnn = CNN()
+    if args.d : cnn = DDP(cnn, bucket_cap_mb=2)
+
+    loss_func = nn.CrossEntropyLoss()   
+
+    num_epochs = 10
+
+    print("starting training")
+
+    print(rank)
+    print(size)
+    
+    schedule = torch.profiler.schedule(
+        wait=1,
+        warmup=1,
+        active=10,
+        repeat=3
+    )
+    
+    # with torch.profiler.profile(
+            # activities=[torch.profiler.ProfilerActivity.CPU],
+            # schedule=schedule,
+            # on_trace_ready=torch.profiler.tensorboard_trace_handler('./accl_log/profiler_log'),
+            # record_shapes=True,
+            # with_stack=True
+    # ) as p:
+
+        
+    train(num_epochs, cnn, loaders)
+
+    test()
+
+
+    dist.destroy_process_group()
diff --git a/integrations/pytorch_ddp/test/test-resnet34.py b/integrations/pytorch_ddp/test/test-resnet34.py
new file mode 100644
index 00000000..fa42963f
--- /dev/null
+++ b/integrations/pytorch_ddp/test/test-resnet34.py
@@ -0,0 +1,253 @@
+import torch
+import torchvision
+from torchvision import datasets
+from torchvision import models
+from torchvision import transforms
+from torchvision.transforms import ToTensor
+from torch.utils.data import DataLoader
+from torch.profiler import profile, ProfilerActivity
+import torch.nn as nn
+from torch import optim
+from torch.autograd import Variable
+import torch.distributed as dist
+import accl_process_group as accl
+
+from mpi4py.MPI import COMM_WORLD as mpi
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+
+import argparse
+import os
+import sys
+import logging
+import time
+
+logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
+
+logger = logging.getLogger(__name__)
+
+if "ACCL_DEBUG" in os.environ and os.environ["ACCL_DEBUG"]=="1":
+    logger.setLevel(logging.DEBUG)
+else:
+    logger.setLevel(logging.WARNING)
+
+# Run via ACCL
+
+def train(num_epochs, model, loaders, criterion, p):
+
+    start_time_train = time.perf_counter()
+    
+    model.train()
+
+    total_step = len(loaders['train'])
+
+    optimizer = optim.Adam(model.parameters(), lr = 0.001)   
+
+    for epoch in range(num_epochs):
+        model.train()
+        running_loss = 0.0
+        for i, (inputs, labels) in enumerate(loaders['train']):
+            p.step()
+            start_time = time.perf_counter()
+            
+            optimizer.zero_grad()
+            outputs = model(inputs)
+            loss = criterion(outputs, labels)
+            loss.backward()
+            optimizer.step()
+            running_loss += loss.item()
+
+            if (i+1) % 100 == 0:
+                break
+            if True:
+                end_time = time.perf_counter()
+                measured_time = (end_time - start_time) * 1000000
+                print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Time(us): {}' 
+                       .format(epoch + 1, num_epochs, i + 1, total_step, loss.item(), measured_time))
+
+    end_time_train = time.perf_counter()
+    measured_time_train = (end_time_train - start_time_train) * 1000000
+
+    print('Total train time: ' + str(measured_time_train))
+        
+
+def test(num_epochs, model, loaders, criterion, p):
+    # Test the model
+    start_time_test = time.perf_counter()
+    model.eval()
+    with torch.no_grad():
+        correct = 0
+        total = 0
+        val_loss = 0
+        for i, (inputs, labels) in enumerate(loaders['test']):
+            p.step()
+            test_output = model(inputs)
+            loss = criterion(test_output, labels)
+            val_loss += loss.item()
+
+            _, predicted = torch.max(test_output, 1)
+            correct_current = (predicted == labels).sum().item()
+            total += labels.size(0)
+            correct += correct_current
+            
+            print(f'Test Batch accuracy: {correct_current}/{labels.size(0)} {correct_current/float(labels.size(0))}')
+
+
+    end_time_test = time.perf_counter()
+    measured_time_test = (end_time_test - start_time_test) * 1000000
+
+    print('Total test time: ' + str(measured_time_test))            
+    print(f'Total accuracy: {correct}/{total} {correct/float(total)}')
+    
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-n", type=int, default=1)
+    parser.add_argument("-d", type=bool, default=None)
+
+
+    parser.add_argument('-s', '--simulator', action='store_true',
+                        default=False, help='Use simulation instead of '
+                                            'hardware')
+    parser.add_argument('-c', '--comms', choices=['udp', 'tcp', 'cyt_rdma'], default='tcp',
+                        help='Run tests over specified communication backend')
+    parser.add_argument('-i', '--host-file', type=str, help='Specify the file, where the host IPs are listed')
+    parser.add_argument('-f', '--fpga-file', type=str, help='Specify the file, where the FPGA IPs are listed')
+    parser.add_argument('-a','--master-address', type=str)
+    parser.add_argument('-p','--master-port', type=str)
+
+
+    args = parser.parse_args()
+
+    if args.n == 1 and args.d == None :
+        print("only one machine specified. Assuming Non distributed setup")
+        args.d = False
+    elif args.n > 1 and args.d == None:
+        print("Assuming DDP setup")
+        args.d = True
+
+
+    host_file = args.host_file
+    fpga_file = args.fpga_file
+    comms = args.comms
+    start_port = 5005
+    
+    global rank, size
+    if args.master_address==None:
+        args.master_address = "localhost"
+    if args.master_port==None:
+        args.master_port = "30505"
+    os.environ['MASTER_ADDR'] = args.master_address
+    os.environ['MASTER_PORT'] = args.master_port
+    rank = mpi.Get_rank()
+    size = mpi.Get_size()
+
+    rxbufsize = 4096 * 1024
+
+    if args.d:
+        if not args.simulator:
+            #default from test.cpp
+            rxbufsize = 4096 * 1024
+            if host_file==None or fpga_file==None: sys.exit('Host and FPGA file need to be specified in hardware mode')
+        
+            with open(host_file, 'r') as hf:
+                host_ips = hf.read().splitlines()
+            
+            with open(fpga_file, 'r') as ff:
+                fpga_ips = ff.read().splitlines()
+
+            if comms == "cyt_rdma":
+                ranks = [accl.Rank(a, start_port, i, rxbufsize) for i, a in enumerate(fpga_ips)]
+            else:
+                ranks = [accl.Rank(a, start_port + i, 0, rxbufsize) for i, a in enumerate(fpga_ips)]
+        else:
+            # Somehow the simulator gets stuck if I use the same rxbufsize
+            rxbufsize = 4096 * 1024
+            ranks = [accl.Rank("127.0.0.1", 5500 + i, i, rxbufsize) for i in range(size)]
+
+        logger.debug(f'Ranks: {ranks}')
+
+        if args.comms == 'udp':
+            design = accl.ACCLDesign.udp
+        elif args.comms == 'tcp':
+            design = accl.ACCLDesign.tcp
+        elif args.comms == 'cyt_rdma': # and not simulator:
+            design = accl.ACCLDesign.cyt_rdma
+    
+
+        mpi.Barrier()            
+    
+        accl.create_process_group(ranks, design, bufsize=rxbufsize, initialize=True, simulation=args.simulator)
+        dist.init_process_group("ACCL", rank=rank, world_size=size)
+        
+    device = 'cpu'
+
+    transform = transforms.Compose([
+        transforms.Resize(256),
+        transforms.CenterCrop(224),
+        transforms.ToTensor(),
+        transforms.Normalize(
+            mean=[0.485, 0.456, 0.406],
+            std=[0.229, 0.224, 0.225]
+        )
+    ])
+
+    train_dataset = datasets.CIFAR10(root='cifar10_data', train=True, download=True, transform=transform)
+    val_dataset = datasets.CIFAR10(root='cifar10_data', train=False, download=True, transform=transform)
+
+    if args.d : sampler = DistributedSampler
+    else : sampler = lambda x : None
+    
+    loaders = {
+        'train' : torch.utils.data.DataLoader(train_dataset, 
+                                              batch_size=32, 
+                                              shuffle=False,
+                                              num_workers=4,
+                                              sampler=sampler(train_dataset)),
+        'test'  : torch.utils.data.DataLoader(val_dataset, 
+                                              batch_size=32, 
+                                              shuffle=False,
+                                              num_workers=4,
+                                              sampler=sampler(val_dataset)),
+    }
+
+    model = models.resnet34(pretrained=True)
+    
+    if args.d : model = DDP(model, bucket_cap_mb=2, broadcast_buffers=True, find_unused_parameters=True)
+
+    loss_func = nn.CrossEntropyLoss()   
+
+    criterion = nn.CrossEntropyLoss()
+    
+    num_epochs = 1
+
+    mpi.Barrier()
+
+    print("starting training")
+
+    schedule = torch.profiler.schedule(
+        wait=1,
+        warmup=1,
+        active=10,
+        repeat=3
+    )
+
+    
+    with torch.profiler.profile(
+            activities=[torch.profiler.ProfilerActivity.CPU],
+            schedule=schedule,
+            on_trace_ready=torch.profiler.tensorboard_trace_handler('./accl_log/profiler_log'),
+            record_shapes=True,
+    ) as p:
+
+        
+        train(num_epochs, model, loaders, criterion, p)
+
+        test(num_epochs, model, loaders, criterion, p)
+
+    p.stop()
+
+    print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=100))
+
+    if args.d : dist.destroy_process_group()
diff --git a/integrations/pytorch_ddp/test/test-resnet50.py b/integrations/pytorch_ddp/test/test-resnet50.py
new file mode 100644
index 00000000..efb4daac
--- /dev/null
+++ b/integrations/pytorch_ddp/test/test-resnet50.py
@@ -0,0 +1,329 @@
+import torch
+import torchvision
+from torchvision import datasets
+from torchvision import models
+from torchvision import transforms
+from torchvision.transforms import ToTensor
+from torch.utils.data import DataLoader
+from torch.profiler import profile, ProfilerActivity
+import torch.nn as nn
+from torch import optim
+from torch.autograd import Variable
+import torch.distributed as dist
+import accl_process_group as accl
+
+from mpi4py.MPI import COMM_WORLD as mpi
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+
+import argparse
+import numpy as np
+import os
+import sys
+import logging
+import time
+
+seed = 43
+torch.manual_seed(seed)
+
+logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
+
+logger = logging.getLogger(__name__)
+
+if "ACCL_DEBUG" in os.environ and os.environ["ACCL_DEBUG"]=="1":
+    logger.setLevel(logging.DEBUG)
+else:
+    logger.setLevel(logging.WARNING)
+
+# Run via ACCL
+
+def train(num_epochs, model, loaders, criterion):
+
+    start_time_train = time.perf_counter()
+    
+    model.train()
+
+    total_step = len(loaders['train'])
+
+    optimizer = optim.Adam(model.parameters(), lr = 0.001)   
+
+    for epoch in range(num_epochs):
+        model.train()
+        running_loss = 0.0
+        for i, (inputs, labels) in enumerate(loaders['train']):
+            start_time = time.perf_counter()
+            
+            optimizer.zero_grad()
+            outputs = model(inputs)
+            loss = criterion(outputs, labels)
+            loss.backward()
+            optimizer.step()
+            running_loss += loss.item()
+
+            if (i+1) % 100 == 0:
+                break
+            if True:
+                end_time = time.perf_counter()
+                measured_time = (end_time - start_time) * 1000000
+                print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Time(us): {}' 
+                       .format(epoch + 1, num_epochs, i + 1, total_step, loss.item(), measured_time))
+
+    end_time_train = time.perf_counter()
+    measured_time_train = (end_time_train - start_time_train) * 1000000
+
+    print('Total train time: ' + str(measured_time_train))
+        
+
+def test(num_epochs, model, loaders, criterion):
+    # Test the model
+    start_time_test = time.perf_counter()
+    model.eval()
+    with torch.no_grad():
+        correct = 0
+        total = 0
+        val_loss = 0
+        for i, (inputs, labels) in enumerate(loaders['test']):
+            test_output = model(inputs)
+            loss = criterion(test_output, labels)
+            val_loss += loss.item()
+
+            _, predicted = torch.max(test_output, 1)
+            correct_current = (predicted == labels).sum().item()
+            total += labels.size(0)
+            correct += correct_current
+            
+            print(f'Test Batch accuracy: {correct_current}/{labels.size(0)} {correct_current/float(labels.size(0))}')
+
+
+    end_time_test = time.perf_counter()
+    measured_time_test = (end_time_test - start_time_test) * 1000000
+
+    print('Total test time: ' + str(measured_time_test))            
+    print(f'Total accuracy: {correct}/{total} {correct/float(total)}')
+
+
+def test_allreduce(numel, testtype):
+
+    shape = (numel,)
+
+    
+    if testtype == torch.int64 or testtype == torch.int32:
+        rand_torch = torch.randint(torch.iinfo(testtype).min/size, torch.iinfo(testtype).max/size,shape, dtype=testtype)
+    else:
+        rand_torch = torch.rand(shape, dtype=testtype)
+    
+    # for i in range(10):
+    if True:
+    
+        # shape = (320001,)
+        x = rand_torch.clone()
+
+        dist.all_reduce(x, dist.ReduceOp.SUM)
+        mpi.Barrier()
+
+        try:
+            np.testing.assert_allclose(x, rand_torch * size)
+        except AssertionError as e:
+            logger.debug("Test AllReduce failed")
+            logger.debug(str(e))
+        else:
+            logger.debug("Test AllReduce finished!")
+
+def test_broadcast(numel, testtype):
+    shape = (numel,)
+
+    # testtype = torch.float32
+    if testtype == torch.int64 or testtype == torch.int32:
+        rand_torch = torch.randint(torch.iinfo(testtype).min, torch.iinfo(testtype).max,shape, dtype=testtype)
+        # rand_torch = torch.ones(shape, dtype=testtype)
+    else:
+        rand_torch = torch.rand(shape, dtype=testtype)
+    
+    # for i in range(10):
+    if True:
+
+        if rank == 1:
+            x = rand_torch.clone()
+        else:
+            x = torch.ones(shape, dtype=testtype)
+        
+        dist.broadcast(x, 1)
+
+        mpi.Barrier()
+
+    # logger.debug('Tensor after broadcast: ' + str(x))
+    # print('Tensor after broadcast: ' + str(x))
+    try:
+        np.testing.assert_allclose(x, rand_torch)
+    except AssertionError as e:
+        logger.debug("Test Broadcast failed")
+        logger.debug(str(e))
+    else:
+        logger.debug("Test broadcast finished!")
+            
+    
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-n", type=int, default=1)
+    parser.add_argument("-d", type=bool, default=None)
+
+
+    parser.add_argument('-s', '--simulator', action='store_true',
+                        default=False, help='Use simulation instead of '
+                                            'hardware')
+    parser.add_argument('-c', '--comms', choices=['udp', 'tcp', 'cyt_rdma'], default='tcp',
+                        help='Run tests over specified communication backend')
+    parser.add_argument('-i', '--host-file', type=str, help='Specify the file, where the host IPs are listed')
+    parser.add_argument('-f', '--fpga-file', type=str, help='Specify the file, where the FPGA IPs are listed')
+    parser.add_argument('-a','--master-address', type=str)
+    parser.add_argument('-p','--master-port', type=str)
+
+
+    args = parser.parse_args()
+
+    if args.n == 1 and args.d == None :
+        print("only one machine specified. Assuming Non distributed setup")
+        args.d = False
+    elif args.n > 1 and args.d == None:
+        print("Assuming DDP setup")
+        args.d = True
+
+
+    host_file = args.host_file
+    fpga_file = args.fpga_file
+    comms = args.comms
+    start_port = 5005
+    
+    global rank, size
+    if args.master_address==None:
+        args.master_address = "localhost"
+    if args.master_port==None:
+        args.master_port = "30505"
+    os.environ['MASTER_ADDR'] = args.master_address
+    os.environ['MASTER_PORT'] = args.master_port
+    rank = mpi.Get_rank()
+    size = mpi.Get_size()
+
+    rxbufsize = 4096 * 1024
+
+    if args.d:
+        if not args.simulator:
+            #default from test.cpp
+            rxbufsize = 4096 * 1024
+            if host_file==None or fpga_file==None: sys.exit('Host and FPGA file need to be specified in hardware mode')
+        
+            with open(host_file, 'r') as hf:
+                host_ips = hf.read().splitlines()
+            
+            with open(fpga_file, 'r') as ff:
+                fpga_ips = ff.read().splitlines()
+
+            if comms == "cyt_rdma":
+                ranks = [accl.Rank(a, start_port, i, rxbufsize) for i, a in enumerate(fpga_ips)]
+            else:
+                ranks = [accl.Rank(a, start_port + i, 0, rxbufsize) for i, a in enumerate(fpga_ips)]
+        else:
+            # Somehow the simulator gets stuck if I use the same rxbufsize
+            rxbufsize = 4096 * 1024
+            ranks = [accl.Rank("127.0.0.1", 5500 + i, i, rxbufsize) for i in range(size)]
+
+        logger.debug(f'Ranks: {ranks}')
+
+        if args.comms == 'udp':
+            design = accl.ACCLDesign.udp
+        elif args.comms == 'tcp':
+            design = accl.ACCLDesign.tcp
+        elif args.comms == 'cyt_rdma': # and not simulator:
+            design = accl.ACCLDesign.cyt_rdma
+    
+
+        mpi.Barrier()            
+    
+        accl.create_process_group(ranks, design, bufsize=rxbufsize, initialize=True, simulation=args.simulator)
+        dist.init_process_group("ACCL", rank=rank, world_size=size)
+
+    # dist.init_process_group("mpi", rank=rank, world_size=size)
+        
+
+    test_allreduce(256, torch.float32)
+    test_broadcast(256, torch.float32)
+
+    test_broadcast(162, torch.int32)
+    # if args.d : dist.destroy_process_group()
+
+    # sys.exit(0)
+    
+    device = 'cpu'
+
+    transform = transforms.Compose([
+        transforms.Resize(256),
+        transforms.CenterCrop(224),
+        transforms.ToTensor(),
+        transforms.Normalize(
+            mean=[0.485, 0.456, 0.406],
+            std=[0.229, 0.224, 0.225]
+        )
+    ])
+
+    train_dataset = datasets.CIFAR10(root='cifar10_data', train=True, download=True, transform=transform)
+    val_dataset = datasets.CIFAR10(root='cifar10_data', train=False, download=True, transform=transform)
+
+    if args.d : sampler = DistributedSampler
+    else : sampler = lambda x : None
+    
+    loaders = {
+        'train' : torch.utils.data.DataLoader(train_dataset, 
+                                              batch_size=32, 
+                                              shuffle=False,
+                                              num_workers=4,
+                                              sampler=sampler(train_dataset)),
+        'test'  : torch.utils.data.DataLoader(val_dataset, 
+                                              batch_size=32, 
+                                              shuffle=False,
+                                              num_workers=4,
+                                              sampler=sampler(val_dataset)),
+    }
+
+    model = models.resnet50(pretrained=True)
+    
+    if args.d : model = DDP(model, bucket_cap_mb=2, broadcast_buffers=True, find_unused_parameters=True)
+
+    loss_func = nn.CrossEntropyLoss()   
+
+    criterion = nn.CrossEntropyLoss()
+    
+    num_epochs = 1
+
+    mpi.Barrier()
+
+    print("starting training")
+
+    schedule = torch.profiler.schedule(
+        wait=1,
+        warmup=1,
+        active=10,
+        repeat=3
+    )
+
+    
+    # with torch.profiler.profile(
+            # activities=[torch.profiler.ProfilerActivity.CPU],
+            # schedule=schedule,
+            # on_trace_ready=torch.profiler.tensorboard_trace_handler('./accl_log/profiler_log'),
+            # record_shapes=True,
+    # ) as p:
+
+    if True:
+    
+        
+        train(num_epochs, model, loaders, criterion)
+
+        # test(num_epochs, model, loaders, criterion)
+
+    # p.stop()
+
+    # print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=100))
+
+    if args.d : dist.destroy_process_group()