forked from facebookresearch/baspacho
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Context
CUDA LU solve is 3.7ms vs cuDSS 0.6ms (6x gap) on c6288 (25380x25380 circuit Jacobian). Factor is already at cuDSS parity (2.85ms vs 2.5ms).
Current solve architecture:
- Separate kernel dispatches for sparse-elim forward L / backward U phases
- Per-lump CPU iteration for 16 dense lumps
- Three
flush()barriers (permutation → L solve → U solve) - Multiple cuBLAS calls for dense GEMV/TRSV
Proposed approach
Implement a persistent kernel solve using cuBLASDx (device-side BLAS), following the cuDSS architecture from Sparse Days 2024:
- Single kernel launch for the entire triangular solve (forward L + backward U)
- Inter-CTA synchronization via
atomicAddondone[]counters — thread blocks spin-wait until dependencies are satisfied, then immediately process their supernode - cuBLASDx for device-side GEMV/TRSV — no separate cuBLAS dispatch overhead
- Level-set parallelism — independent supernodes at the same tree level processed by different thread blocks simultaneously
Key requirements
- cuBLASDx — compile-time template library for device-side BLAS
- NVIDIA forward-progress guarantee for resident thread blocks
- Pre-computed dependency graph (already available via
LevelSetSchedule) - Shared memory sizing for per-supernode TRSV/GEMV working storage
Expected improvement
- Solve: 3.7ms → <1ms (eliminate all kernel launch + cuBLAS dispatch overhead)
- Total LU: 6.5ms → ~4ms (approaching cuDSS's 3.1ms)
Notes
- This is CUDA-only; Metal lacks forward-progress guarantees and device-side BLAS
- The existing modular solve (Solver.cpp
internalSolveLRangeUnit/internalSolveURange) should remain as fallback for non-CUDA backends - cuBLASDx requires specifying matrix sizes at compile time via templates — may need a few size specializations or runtime dispatch
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels