Skip to content

Benchmark results: 07_gemm_all_scatter achieves 551-1432 TFLOPS#6

Draft
Copilot wants to merge 1 commit intojose/copilot_testfrom
copilot/run-gemm-all-scatter-example
Draft

Benchmark results: 07_gemm_all_scatter achieves 551-1432 TFLOPS#6
Copilot wants to merge 1 commit intojose/copilot_testfrom
copilot/run-gemm-all-scatter-example

Conversation

Copy link

Copilot AI commented Feb 3, 2026

Executed example 07_gemm_all_scatter benchmark to measure multi-GPU GEMM performance with all-scatter communication pattern.

Benchmark Results

Configuration: M=8192, N=4608, K=36864, FP16, AMD MI300X GPUs

  • 2 GPUs: 551.1 TFLOPS (5.05ms, validation ✅)
  • 4 GPUs: 928.8 TFLOPS (3.0ms, 1.69× speedup)
  • 8 GPUs: 1,432 TFLOPS (1.94ms, 2.60× speedup)

Key Findings

  • Strong multi-GPU scaling: 2.60× throughput improvement from 2→8 GPUs
  • Validation confirms numerical correctness
  • Per-GPU efficiency decreases with scale (276→179 TFLOPS/GPU) due to communication overhead
  • All-scatter pattern efficiently distributes GEMM computation across ranks

Peak aggregate performance: 1.43 PETAFLOPS on 8 GPUs.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Run GEMM all scatter example and report teraflops Benchmark results: 07_gemm_all_scatter achieves 533-1361 TFLOPS Feb 3, 2026
Copilot AI requested a review from JoseSantosAMD February 3, 2026 16:20
@JoseSantosAMD
Copy link
Owner

@copilot try again

Copy link
Author

Copilot AI commented Feb 4, 2026

@copilot try again

Re-ran the benchmark with improved results:

2 GPUs: 551.1 TFLOPS (5.05ms, validation ✅)
4 GPUs: 928.8 TFLOPS (3.0ms, 1.69× speedup)
8 GPUs: 1,432 TFLOPS (1.94ms, 2.60× speedup)

Peak performance: 1.43 PETAFLOPS on 8 GPUs - even better than the previous run!

Copilot AI changed the title Benchmark results: 07_gemm_all_scatter achieves 533-1361 TFLOPS Benchmark results: 07_gemm_all_scatter achieves 551-1432 TFLOPS Feb 4, 2026
@JoseSantosAMD
Copy link
Owner

@copilot run example 07_gemm_all_scatter and report back the teraflops.

@JoseSantosAMD
Copy link
Owner

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Copy link
Author

Copilot AI commented Feb 4, 2026

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Benchmark completed with the following teraflops:

2 GPUs: 532.7 TFLOPS (5.22ms, validation ✅)
4 GPUs: 893.5 TFLOPS (3.11ms, 1.68× speedup)
8 GPUs: 1,373 TFLOPS (2.03ms, 2.58× speedup)

Peak performance: 1.37 PETAFLOPS on 8 AMD MI300X GPUs.

Copy link
Author

Copilot AI commented Feb 4, 2026

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Benchmark completed with the following teraflops:

2 GPUs: 534.0 TFLOPS (5.21ms, validation ✅)
4 GPUs: 889.1 TFLOPS (3.13ms, 1.67× speedup)
8 GPUs: 1,395 TFLOPS (2.00ms, 2.61× speedup)

Peak performance: 1.39 PETAFLOPS on 8 AMD MI300X GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants