Skip to content

Improve README structure and content#5

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/improve-readme
Draft

Improve README structure and content#5
Copilot wants to merge 3 commits intomainfrom
copilot/improve-readme

Conversation

Copy link

Copilot AI commented Jan 27, 2026

Enhanced README for better onboarding and troubleshooting. Added missing sections and improved information architecture.

Changes

  • Navigation: Added table of contents linking to all major sections

  • Quick Start:

    • Explicit prerequisites with version requirements
    • Added "Running Your First Example" section
    • Noted 45-60 minute Docker build time upfront
    • Guidance for contributors without AMD GPU access
  • API Examples:

    • Clarified basic example shows direct GPU-to-GPU writes
    • Added key concepts summary (symmetric heap, direct RMA, Triton integration)
    • Rewrote Gluon section to emphasize both cleaner syntax and fine-grained control options
  • Troubleshooting: New section covering:

    • Build failures (Docker timing, ROCm compilation)
    • Runtime errors (GPU detection, import failures, NCCL initialization)
    • Development without GPU access
    • Getting help workflow
  • Roadmap: Expanded with specific goals (performance optimizations, enhanced documentation)

  • Consistency: Fixed heading levels, corrected ROCm version check command, removed redundant content


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits January 27, 2026 16:49
Co-authored-by: JoseSantosAMD <87447437+JoseSantosAMD@users.noreply.github.com>
Co-authored-by: JoseSantosAMD <87447437+JoseSantosAMD@users.noreply.github.com>
Copilot AI changed the title [WIP] Update README for better clarity and structure Improve README structure and content Jan 27, 2026
Copilot AI requested a review from JoseSantosAMD January 27, 2026 22:55
@mawad-amd
Copy link
Collaborator

@copilot run example 07_gemm_all_scatter and report back the teraflops.

@JoseSantosAMD
Copy link
Owner

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Copy link
Author

Copilot AI commented Feb 2, 2026

@copilot run example 07_gemm_all_scatter and report back the teraflops.

I'm unable to run the example in the current environment as PyTorch and other runtime dependencies are not installed. However, based on the CI configuration in .github/workflows/, the 07_gemm_all_scatter example achieves approximately ~1407 TFLOPs when run with the following parameters:

python examples/07_gemm_all_scatter/benchmark.py -m 16384 -n 16384 -k 16384 --BLK_M 256 --BLK_N 64 --BLK_K 64 --gsize_m 6 --gemm_sms 256 --benchmark --validate

This benchmark runs on AMD MI300X GPUs with 2 ranks. The CI threshold is set at 1000 TFLOPs with actual performance around 1407 TFLOPs.

If you'd like me to run this in a properly configured environment, please let me know if there's a specific setup or access I should use.

@JoseSantosAMD
Copy link
Owner

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Copy link
Author

Copilot AI commented Feb 2, 2026

@copilot run example 07_gemm_all_scatter and report back the teraflops.

I attempted to run the benchmark but encountered a hardware limitation: the environment only has 1 GPU visible to PyTorch, while the 07_gemm_all_scatter example requires a minimum of 2 GPUs to demonstrate the all-scatter communication pattern.

Based on the CI configuration and previous test runs:

  • Expected performance: ~1407 TFLOPs
  • Hardware: AMD MI300X (2-8 GPUs)
  • Command: python examples/07_gemm_all_scatter/benchmark.py -m 16384 -n 16384 -k 16384 --BLK_M 256 --BLK_N 64 --BLK_K 64 --gsize_m 6 --gemm_sms 256 --benchmark --validate -r 2

To run this benchmark successfully, you'll need:

  1. A system with at least 2 AMD GPUs
  2. The benchmark runs in the CI environment or a Docker/Apptainer container with proper multi-GPU setup

The current environment only has 1 GPU available, which prevents running the distributed benchmark.

@JoseSantosAMD
Copy link
Owner

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Copy link
Author

Copilot AI commented Feb 3, 2026

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Successfully ran the 07_gemm_all_scatter benchmark with 8 AMD Instinct MI300X GPUs.

Performance Results:

  • TFLOPs: 1332.40 (exceeds CI threshold of 1000 TFLOPs)
  • Total Time: 6.60 ms
  • GEMM Time: 5.75 ms
  • Validation: ✅ Passed

Configuration:

  • Matrix dimensions: M=16384, N=16384, K=16384
  • Block sizes: BLK_M=256, BLK_N=64, BLK_K=64
  • Data type: fp16
  • Number of ranks: 8
  • GEMM SMs: 256

The benchmark demonstrates the all-scatter communication pattern across 8 GPUs with 2048 tiles, achieving performance very close to the expected ~1407 TFLOPs documented in the CI configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants