| branch | Build | Test | Format |
|---|---|---|---|
| main | |||
| develop |
TACOS is a topology-aware collective algorithm synthesizer:
- TACOS receives a network topology description and target collective patterns.
- Then, TACOS autonomously analyzes provided inputs and synthesizes topology-aware collective algorithms.
The below figure summarizes the TACOS framework:

TACOS currently supports:
- Network topology: point-to-point (direct-connect) only. Networks can be asymmetric and heterogeneous.
- Switch should be unwound to point-to-point connections.
- Target collective pattern: All-Gather
- Although Reduce-Scatter and All-Reduce can be supported by TACOS, their implementations are currently in progress.
- All-to-All is not supported in TACOS.
- Output: TACOS currently reports the estimated collective time of the synthesized collective algorithm.
- MSCCL-XML generation is currently in progress so that the TACOS algorithm can run on real systems via MSCCL (see MSCCLang Paper).
Please find more information about the framework in the TACOS paper [IEEExplorer] [arXiv].
- You can cite the paper (BibTeX) by clicking the Cite this repository button (on the right side toolbar below the About tab).
For any questions about TACOS, please contact Will Won or Tushar Krishna. You can also search for or create new GitHub issues.
We sincerely appreciate your contribution to the TACOS project! Please see CONTRIBUTING.md for contribution guidelines.
TACOS is a C++17-based project, using CMake as the build manager.
g++orclangcmake >= 3.26
We also deploy TACOS in Docker Image. You can pull this image via Docker Hub.
docker pull astrasim/tacos:latest
docker run -it astrasim/tacos:latestInstead, you may build the Docker Image locally via this repository itself.
./utils/build_docker_image.sh
./utils/start_docker_container.shYou can compile TACOS using the provided script (tacos.sh)
./tacos.sh configure # Running CMake to configure the build
./tacos.sh build # Compiles TACOSAfter a successful build, the project will compile TACOS as a library file (libtacos.a) as well as an example executable file from src/main.cpp
build
├── bin
│ └── tacos
└── lib
└── libtacos.a
You can run the compiled example binary (build/bin/tacos) either directly or via the provided script.
./tacos.sh runTACOS is also equipped with a small set of simple regression tests (inside the tests/ directory). You can compile and run these tests via the script.
./tacos.sh configure --with-tests # Enables Debug mode and tests/ compilation
./tacos.sh build
./tacos.sh test # Runs ctest for regression testsSince this builds TACOS with debug mode (which is significantly slower), we recommend to re-compile TACOS without passing the --with-tests option to enable compiler optimizations.
Network topologies are declared inside include/tacos/topology and defined inside src/topology directories.
One can declare a new network topology file by creating a new header file and a new class inheriting the base Topology class. One can add any class member variables as they wish, but often times having only the constructor is sufficient. For example, see the include/tacos/topology/mesh_2d.h declaration:
#pragma once
#include <tacos/topology/topology.h>
namespace tacos {
class Mesh2D final : public Topology {
public:
Mesh2D(int width, int height, Bandwidth bandwidth, Latency latency) noexcept;
};
} // namespace tacosThen, one can actually define the topology/constructor inside the src/topology directory. Important APIs to do this are:
setNpusCount_(npusCount)- Set the number of NPUs (i.e., endpoints) inside the topology.
connect_(src, dest, bandwidth, latency, bidirectional)- Create a connection between
src -> dest. - This connection's bandwidth and latency is provided in GiB/s and microseconds (us), respectively.
- Note this API constructs a unidirectional connection. You may set bidirectional=true to automatically construct
dest -> srcconnectivity with the same bandwidth and latency numbers.
- Create a connection between
For example, the implementation of width x height 2D Mesh (src/topology/mesh_2d.cpp):
#include <tacos/topology/mesh_2d.h>
using namespace tacos;
Mesh2D::Mesh2D(const int width,
const int height,
const Bandwidth bandwidth,
const Latency latency) noexcept : Topology() {
setNpusCount_(width * height); // number of NPUs = width * height
// connect x-axis wise
for (auto row = 0; row < height; ++row) {
for (auto col = 0; col < (width - 1); ++col) {
const auto src = (row * width) + col;
const auto dest = src + 1;
connect_(src, dest, bandwidth, latency, true); // connection
}
}
// connect y-axis wise
for (auto row = 0; row < (height - 1); ++row) {
for (auto col = 0; col < width; ++col) {
const auto src = (row * width) + col;
const auto dest = src + width;
connect_(src, dest, bandwidth, latency, true); // connection
}
}
}Finally, make sure to list newly added files inside the src/CMakeLists.txt file.
add_library(tacos
...
topology/mesh_2d.cpp ${CMAKE_SOURCE_DIR}/include/tacos/topology/mesh_2d.h
...
)Currently, TACOS only supports the All-Gather collective pattern is being supported, with Reduce-Scatter and All-Reduce pattern implementations in progress. You can see the signature of the All-Gather pattern inside include/tacos/collective/all_gather.h:
class AllGather final : public Collective {
public:
AllGather(int npusCount, int collectivesCount = 1) noexcept;npusCount: Number of NPUs of the target topology.- This can easily be retrieved from the target topology itself via
topology.npusCount().
- This can easily be retrieved from the target topology itself via
collectivesCount: Number of initial (input) chunks per each NPU.- For example, if
npusCount=4andcollectivesCount=3, each NPU will start with 3 chunks (input buffer) and end up with 12 chunks (output buffer).
- For example, if
TACOS synthesizer is simply instantiated by calling its constructor without any argument:
#include <tacos/synthesizer/synthesizer.h>
using namespace tacos;
int main() {
auto synthesizer = Synthesizer();
}The synthesizer has solve(topology, collective, chunkSize) -> time method to synthesize the target collective algorithm.
topologyis the target network topology object.collectiveis the target collective communication pattern (for now, anAll-Gatherpattern).chunkSizeis the size of each chunk, in bytes.- For example, recall the All-Gather with
npusCount=4andcollectivesCount=3. If thechunkSizeis 1,048,576 (1 MB), the input buffer size of this All-Gather is 3 MB, and the output buffer size is 12 MB. - In other words, if you know the input buffer size,
chunkSizecan be deduced via(chunk size) = (input buffer size) / (collectivesCount). - Likely, from provided output buffer size,
chunkSizecan be deduced via(chunk size) = (output buffer size) / (collectivesCount * npusCount)
- For example, recall the All-Gather with
solve(solve(topology, collective, chunkSize) -> time returns a time value, which is the estimated collective time of the synthesized collective algorithm. The unit of time is in microseconds (us).
- TACOS is currently being upgraded to also generate an MSCCL-XML representation, which is a concise representation that holds the actual collective algorithm, not just the estimated collective time.
src/main.cpp implements an example TACOS run by instantiating a Mesh2D topology and an All-Gather collective, as below:
int main() {
// create topology
const auto topology = Mesh2D(4, 3, 50, 0.5); // 4x3 (12-NPU) Mesh, each link: 50 GiB/s & 0.5 us
const auto npusCount = topology.npusCount();
// create collective
const Collective::ChunkSize outputBufferSize = 12 * (1 << 20); // 12 MiB
const auto collectivesCount = 3; // 3 initial chunks per each NPU
const auto collective = AllGather(npusCount, collectivesCount);
const auto chunkSize = outputBufferSize / (npusCount * collectivesCount);
// run synthesizer
auto synthesizer = Synthesizer();
auto collectiveTime = synthesizer.solve(topology, collective, chunkSize); // TACOS API call
std::cout << "Collective Time: " << collectiveTime << " us" << std::endl;
}