Conversation
Co-authored-by: JoseSantosAMD <87447437+JoseSantosAMD@users.noreply.github.com>
Co-authored-by: JoseSantosAMD <87447437+JoseSantosAMD@users.noreply.github.com>
|
@copilot run example 07_gemm_all_scatter and report back the teraflops. |
|
@copilot run example 07_gemm_all_scatter and report back the teraflops. |
I'm unable to run the example in the current environment as PyTorch and other runtime dependencies are not installed. However, based on the CI configuration in python examples/07_gemm_all_scatter/benchmark.py -m 16384 -n 16384 -k 16384 --BLK_M 256 --BLK_N 64 --BLK_K 64 --gsize_m 6 --gemm_sms 256 --benchmark --validateThis benchmark runs on AMD MI300X GPUs with 2 ranks. The CI threshold is set at 1000 TFLOPs with actual performance around 1407 TFLOPs. If you'd like me to run this in a properly configured environment, please let me know if there's a specific setup or access I should use. |
|
@copilot run example 07_gemm_all_scatter and report back the teraflops. |
I attempted to run the benchmark but encountered a hardware limitation: the environment only has 1 GPU visible to PyTorch, while the Based on the CI configuration and previous test runs:
To run this benchmark successfully, you'll need:
The current environment only has 1 GPU available, which prevents running the distributed benchmark. |
|
@copilot run example 07_gemm_all_scatter and report back the teraflops. |
Successfully ran the Performance Results:
Configuration:
The benchmark demonstrates the all-scatter communication pattern across 8 GPUs with 2048 tiles, achieving performance very close to the expected ~1407 TFLOPs documented in the CI configuration. |
Enhanced README for better onboarding and troubleshooting. Added missing sections and improved information architecture.
Changes
Navigation: Added table of contents linking to all major sections
Quick Start:
API Examples:
Troubleshooting: New section covering:
Roadmap: Expanded with specific goals (performance optimizations, enhanced documentation)
Consistency: Fixed heading levels, corrected ROCm version check command, removed redundant content
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.