Skip to content

Comments

[ROCm] Add thread safety test for hipblaslt GetAlgorithms#609

Draft
i-chaochen wants to merge 2 commits intorocm-jaxlib-v0.8.0from
chao/hipblaslt_data_race
Draft

[ROCm] Add thread safety test for hipblaslt GetAlgorithms#609
i-chaochen wants to merge 2 commits intorocm-jaxlib-v0.8.0from
chao/hipblaslt_data_race

Conversation

@i-chaochen
Copy link
Collaborator

@i-chaochen i-chaochen commented Feb 1, 2026

Add unit tests to detect potential data races in hipblaslt's MasterSolutionLibrary and ContractionSolution when multiple threads concurrently call GetAlgorithms.

Motivation

This test is designed to help catch race conditions that can cause segfaults in multi-threaded JAX workloads using BF16 GEMM operations, where:

  • One thread loads solution libraries (loadLibrary)
  • Another thread calls getSolutionByIndex concurrently
  • Mutable state like autoGSU is modified without synchronization

Technical Details

The test exercises:

  1. Concurrent GetAlgorithms calls on the same plan
  2. Concurrent access with different problem sizes (triggers lazy loading)

[ROCm] Add multi-stream GEMM execution tests for hipblaslt synchronizer buffer race

Add two new test cases to hip_blas_lt_thread_test.cc to exercise concurrent
GEMM execution paths that may trigger the hipblaslt synchronizer buffer race:

  • MultiStreamGemmExecutionRace: Creates multiple streams with a shared BlasLt
    plan and executes concurrent GEMM operations, verifying result correctness.

  • FireAndForgetGemmRace: More aggressive test with 8 streams and rapid GEMM
    launches to maximize GPU kernel overlap.

These tests allocate device memory, execute actual hipblasLtMatmul operations
via XLA's BlasLt interface, and verify output correctness. While they may not
reliably reproduce the GSU synchronizer buffer race, they serve as:

  • Stress tests for concurrent GEMM execution
  • Regression tests after hipblaslt fixes are applied
  • Documentation of the known multi-stream race condition pattern

Add unit tests to detect potential data races in hipblaslt's
MasterSolutionLibrary and ContractionSolution when multiple threads
concurrently call GetAlgorithms.

This test is designed to help catch race conditions that can cause
segfaults in multi-threaded JAX workloads using BF16 GEMM operations,
where:
- One thread loads solution libraries (loadLibrary)
- Another thread calls getSolutionByIndex concurrently
- Mutable state like autoGSU is modified without synchronization

The test exercises:
1. Concurrent GetAlgorithms calls on the same plan
2. Concurrent access with different problem sizes (triggers lazy loading)
@i-chaochen i-chaochen force-pushed the chao/hipblaslt_data_race branch from 88e394c to 0b5bdc1 Compare February 1, 2026 17:10
…er buffer race

Add two new test cases to hip_blas_lt_thread_test.cc to exercise concurrent
GEMM execution paths that may trigger the hipblaslt synchronizer buffer race:

- MultiStreamGemmExecutionRace: Creates multiple streams with a shared BlasLt
  plan and executes concurrent GEMM operations, verifying result correctness.

- FireAndForgetGemmRace: More aggressive test with 8 streams and rapid GEMM
  launches to maximize GPU kernel overlap.

These tests allocate device memory, execute actual hipblasLtMatmul operations
via XLA's BlasLt interface, and verify output correctness. While they may not
reliably reproduce the GSU synchronizer buffer race (which requires specific
algorithm selection like GSUAMBSK), they serve as:
- Stress tests for concurrent GEMM execution
- Regression tests after hipblaslt fixes are applied
- Documentation of the known multi-stream race condition pattern

Also adds scratch_allocator dependency to BUILD file.
@i-chaochen i-chaochen force-pushed the chao/hipblaslt_data_race branch from bc34256 to d29ad35 Compare February 2, 2026 00:59
@i-chaochen i-chaochen marked this pull request as draft February 23, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed rocm-jaxlib-v0.8.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant