Releases · ByteDance-Seed/Triton-distributed · GitHub

12 Sep 08:49

Enhance overlap kernels Pre-release

Pre-release

Add Ulysess SP kernels and improve EP kernels.

Assets 5

20 Aug 01:44

houqi

v0.0.1-rc Pre-release

Pre-release

Compiled with

Triton v3.4
NVSHMEM: v3.3.9

What's Changed

feat: support mega kernel in #93 by @XG-zheng
feat: support E2E MoE models like Qwen/Qwen3-235B-A22B in #85 by @houqi @XG-zheng @KnowingNothing @wenlei-bao @preminstrel
feat: support GEMM+AllReduce on Hopper
feat: GroupedGEMM+ReduceScatter supported on L20/Ampere
feat: default use NVLS ld_reduce with .acc::f32 precision for BF16/FP16 reduction: for better precision
fix: support NVLS multimem.st in vectorized way
fix: fix some hang problem with cooperative_launch_grids. close #81
fix: some BUGs in AG+GroupedGEMM which may cause unexpected memory access
opt: AllReduce One-Shot latency to 9us in H800x8 on very small data message： close #57
opt: AllReduce Two-Shot latency performance fix: return symmetric buffer directly to save some d2d copy overhead
opt: AllReduce DoubleTree implementation much faster but still not for production: better pipeline needed.
trival: support compile without CUDA toolkit and torch
Enable rocSHMEM host API usage by @drprajap in #68

Known Issue

AMD related is not included in the wheels. if you want to try AMD, build it yourself.

Full Changelog: experimental...v0.0.1-rc

Contributors

houqi, KnowingNothing, and 4 other contributors

Assets 7

11 Jul 10:19

v0.0.1 Pre-release

Pre-release

Environments

container tag: nvcr.io/nvidia/pytorch:25.04-py3
Triton-v3.4
NVSHMEM4py-v3.3.9
PyTorch 2.4+ w/o dynamo (not support for PyTorch compile)
CUDA 12+
Python 3.12+

Assets 3