Skip to content

CUDA solve: persistent kernel with cuBLASDx for device-side triangular solve #5

@robtaylor

Description

@robtaylor

Context

CUDA LU solve is 3.7ms vs cuDSS 0.6ms (6x gap) on c6288 (25380x25380 circuit Jacobian). Factor is already at cuDSS parity (2.85ms vs 2.5ms).

Current solve architecture:

  • Separate kernel dispatches for sparse-elim forward L / backward U phases
  • Per-lump CPU iteration for 16 dense lumps
  • Three flush() barriers (permutation → L solve → U solve)
  • Multiple cuBLAS calls for dense GEMV/TRSV

Proposed approach

Implement a persistent kernel solve using cuBLASDx (device-side BLAS), following the cuDSS architecture from Sparse Days 2024:

  1. Single kernel launch for the entire triangular solve (forward L + backward U)
  2. Inter-CTA synchronization via atomicAdd on done[] counters — thread blocks spin-wait until dependencies are satisfied, then immediately process their supernode
  3. cuBLASDx for device-side GEMV/TRSV — no separate cuBLAS dispatch overhead
  4. Level-set parallelism — independent supernodes at the same tree level processed by different thread blocks simultaneously

Key requirements

  • cuBLASDx — compile-time template library for device-side BLAS
  • NVIDIA forward-progress guarantee for resident thread blocks
  • Pre-computed dependency graph (already available via LevelSetSchedule)
  • Shared memory sizing for per-supernode TRSV/GEMV working storage

Expected improvement

  • Solve: 3.7ms → <1ms (eliminate all kernel launch + cuBLAS dispatch overhead)
  • Total LU: 6.5ms → ~4ms (approaching cuDSS's 3.1ms)

Notes

  • This is CUDA-only; Metal lacks forward-progress guarantees and device-side BLAS
  • The existing modular solve (Solver.cpp internalSolveLRangeUnit / internalSolveURange) should remain as fallback for non-CUDA backends
  • cuBLASDx requires specifying matrix sizes at compile time via templates — may need a few size specializations or runtime dispatch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions