Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
72c6c28
Add Metal backend for Apple Silicon GPU acceleration
robtaylor Dec 23, 2025
5d0c03b
Add OpenCL backend for portable GPU acceleration
robtaylor Dec 25, 2025
6bd4dce
Add GitHub Actions CI with GPU testing infrastructure
robtaylor Dec 23, 2025
90530f0
Add cuDSS-style block CSR interface for solver creation
robtaylor Jan 2, 2026
5958fea
Add LU factorization support for single-block matrices
robtaylor Jan 2, 2026
e5683db
Add UMFPACK comparison tests for LU factorization validation
robtaylor Jan 2, 2026
49f264a
Fix multi-block LU factorization Schur complement updates
robtaylor Jan 3, 2026
9b6f91d
Add LDL^T factorization for symmetric indefinite matrices
robtaylor Jan 3, 2026
a06a4a0
Address code review feedback for LDL^T implementation
robtaylor Jan 3, 2026
508f01e
Relax OpenCL double precision tolerance for CI
robtaylor Jan 3, 2026
55948b7
Add LU factorization support to Metal backend and expand UMFPACK tests
robtaylor Feb 23, 2026
8ac0a35
Add LU factorization support to CUDA backend
robtaylor Feb 23, 2026
d9abcd9
Move LU operations to GPU for both CUDA and Metal backends
robtaylor Feb 23, 2026
2332dcd
Add GPU profiling instrumentation for Metal and CUDA LU operations
robtaylor Feb 23, 2026
fc79384
Replace Cloud Run GPU CI with self-hosted runners
robtaylor Feb 23, 2026
e4b9fe3
[cmake] Make llvm-ar optional, fall back to CMAKE_AR
robtaylor Feb 24, 2026
dabaf6b
[cmake] Improve llvm-ar error message with install instructions
robtaylor Feb 24, 2026
3f0d77c
Add GPU vs UMFPACK comparison tests and async dispatch infrastructure
robtaylor Feb 24, 2026
8ed0870
Fuse saveGemm dispatches into batched kernel for Metal LU
robtaylor Feb 24, 2026
8dd9e8b
Fix shared buffer race condition in batched saveGemm
robtaylor Feb 24, 2026
4a503fb
Add cuDSS vs BaSpaCho nsys profiling benchmark
robtaylor Feb 25, 2026
8104204
Fix runner labels: use nvidia-runner-1 without self-hosted
robtaylor Feb 25, 2026
df242d3
Add cmake and build deps to CUDA self-hosted runner
robtaylor Feb 25, 2026
b2cdc43
Fix sparse_elim_straight kernel, add missing Metal kernels & per-kern…
robtaylor Feb 25, 2026
730501b
Use NVIDIA CUDA container for GPU CI jobs
robtaylor Feb 25, 2026
debb3b4
Fix cuDSS profiling workflow: correct version, add nsys install
robtaylor Feb 25, 2026
155ec40
Fix nsight-systems install: use versioned package name
robtaylor Feb 25, 2026
3a27154
Switch CUDA containers to Ubuntu 24.04
robtaylor Feb 25, 2026
11f7dc8
Simplify cuDSS install: use apt package on Ubuntu 24.04
robtaylor Feb 25, 2026
2a20518
Fix unused parameter warning in CUDA trsmUpperRight
robtaylor Feb 25, 2026
ef425c5
Update CudssBenchmarkTest for cuDSS 0.7 API
robtaylor Feb 25, 2026
690a7f9
Fix nsys profile step: remove invalid cd in container
robtaylor Feb 25, 2026
f59fc6e
Fix FastSymbolic tests
robtaylor Feb 25, 2026
acbc8d9
Add fast symbolic analysis using CHOLMOD
robtaylor Feb 27, 2026
a09b15e
Add supernode merging and level-set scheduling
robtaylor Feb 27, 2026
709d5f3
Address code review findings for Milestones 1 & 2
robtaylor Feb 27, 2026
e6206c5
Integrate supernode merging and level-set scheduling into solver pipe…
robtaylor Feb 27, 2026
5425f30
Address code review findings for Milestone 3
robtaylor Feb 27, 2026
d8575d6
Fix division-by-zero UB in RandomScalarMatrix test
robtaylor Feb 27, 2026
a4b2238
Cache FetchContent dependencies in CI workflows
robtaylor Feb 27, 2026
ff06a00
Add CI performance regression detection for GPU backends
robtaylor Feb 27, 2026
6bbbe10
Add cuDSS benchmark comparison and CI performance summaries
robtaylor Feb 27, 2026
c2b751a
Replace Metal potrf/trsm CPU fallbacks with MPS GPU implementations
robtaylor Feb 27, 2026
4796175
Address code review: cache status buffer, add edge-case tests
robtaylor Feb 27, 2026
09b144e
Implement batched Metal factorization, solve, and benchmarks
robtaylor Feb 27, 2026
78bbda8
Address code review: fix solveLt test, add potrf status checks
robtaylor Feb 27, 2026
d8590c1
Reduce Metal synchronization overhead with deferred GPU sync
robtaylor Feb 28, 2026
0e4b27e
Add sequence LU factorization tests for circuit simulator workflow
robtaylor Feb 28, 2026
1fa1ee0
Add MTYPE_GENERAL support to createSolver for LU factorization
robtaylor Feb 28, 2026
32ca906
Add performance timing to sequence tests and commit ring_sequence data
robtaylor Feb 28, 2026
e406e6a
Add static pivoting for LU factorization (Milestone 4)
robtaylor Feb 28, 2026
62d4b3d
Add MC64 preprocessing for LU factorization
robtaylor Mar 1, 2026
4a0afbb
Add LU sparse elimination for Metal GPU (50x faster factorization)
robtaylor Mar 1, 2026
d4ff88a
Add LU sparse elimination solve for Metal GPU (240x faster solve)
robtaylor Mar 1, 2026
343afce
Add LU benchmark tool (lu_bench) with CI integration
robtaylor Mar 1, 2026
53eb267
Fix CUDA LU segfault: read diagonal values via NumericCtx
robtaylor Mar 1, 2026
2a11a74
Add cuDSS LU solver to lu_bench for baseline comparison
robtaylor Mar 1, 2026
fa18e91
Document session findings: Metal-CUDA parity analysis complete
robtaylor Mar 1, 2026
1d479ef
Fix cuDSS LU build: use CUDSS_MVIEW_FULL (compatible with cuDSS 0.4)
robtaylor Mar 1, 2026
43150f8
Port LU sparse elimination kernels to CUDA
robtaylor Mar 2, 2026
bab6b75
Fix lu_bench GPU mirror re-allocation in refinement loop
robtaylor Mar 2, 2026
5c82c1e
Batch Metal solve dispatches: 7.6x solve speedup
robtaylor Mar 2, 2026
7eda82c
Batch pivot upload in Metal solve: eliminate per-lump sync
robtaylor Mar 2, 2026
dabeed2
Optimize Metal Cholesky factor: three-tier potrf routing + batch comm…
robtaylor Mar 2, 2026
fb86c5c
GPU-resident pivots: eliminate per-lump commitAndWait in Metal LU factor
robtaylor Mar 2, 2026
4e7fcf2
Document Level-Set Concurrent Dispatch investigation (Step 6)
robtaylor Mar 2, 2026
bdd0f77
CPU BLAS fallback for Metal dense LU: 8.8x factor speedup
robtaylor Mar 2, 2026
b4f93c3
CPU fallback for Metal dense solve: eliminate per-lump GPU sync
robtaylor Mar 3, 2026
56d8d86
Add c6288 sequence test data, research docs, and Python bindings
robtaylor Mar 3, 2026
5b799e1
Fix Metal BLAS transpose constant + add Cholesky GRID profiling bench…
robtaylor Mar 3, 2026
f84c4a5
Document Cholesky GRID profiling findings in learnings
robtaylor Mar 3, 2026
5e5286e
Fix BLAS GEMM transpose in Cholesky saveSyrkGemm CPU fallback
robtaylor Mar 3, 2026
42ad0fa
CPU BLAS fallback for batched Metal Cholesky: 5x speedup on GRID
robtaylor Mar 3, 2026
8023322
GPU-resident pivots: eliminate per-lump H→D memcpy in CUDA LU factor
robtaylor Mar 3, 2026
1760470
Batch small GEMM dispatches in CUDA LU factor: custom kernel replaces…
robtaylor Mar 3, 2026
51bc2d6
Defer CUDA LU pivot readback and perturb count: eliminate per-lump sy…
robtaylor Mar 3, 2026
5e6fbcc
Async H→D transfers via pinned memory in CUDA LU dense loop
robtaylor Mar 3, 2026
b89995d
Add LU dense loop profiling (BASPACHO_PROFILE_LU env var)
robtaylor Mar 3, 2026
abded36
Fix CUDA batched GEMM: atomicAdd for correctness + thread utilization
robtaylor Mar 3, 2026
d8f703c
CUDA LU: CPU BLAS dense fallback + lazy readValue cache (237ms → 3ms)
robtaylor Mar 4, 2026
2778e25
Metal LU sparse elim: pre-computed work list (3 vs 16 buffer bindings)
robtaylor Mar 5, 2026
709c447
Metal: batch sparse elim dispatches (20 sync → 1 command buffer)
robtaylor Mar 5, 2026
e8a8bf7
Add OSSignposter instrumentation for LU factor phases
robtaylor Mar 5, 2026
cb8cc78
Add dual Metal command queues for pipelined GPU execution
robtaylor Mar 5, 2026
567974f
Metal: route sparse elim to async command queue for pipeline parallelism
robtaylor Mar 5, 2026
f2c5307
Metal: remove CPU BLAS fallbacks from factor + batched factor
robtaylor Mar 8, 2026
0bddeeb
Metal: remove CPU Eigen fallbacks from LU solve methods
robtaylor Mar 8, 2026
572a2ac
Metal: add 5 GPU kernels for Cholesky dense solve
robtaylor Mar 8, 2026
71c7212
CUDA: remove cpuBlasMode_ infrastructure
robtaylor Mar 8, 2026
a9b55c8
GPU maxAbsDiag reduction: replace CPU readValue loop with GPU kernel
robtaylor Mar 8, 2026
c0aca15
Metal: remove inter-phase CPU sync from solve paths
robtaylor Mar 8, 2026
7111507
Remove BackendRef from public API
robtaylor Mar 8, 2026
4502b9b
Update docs for pure GPU backend architecture
robtaylor Mar 8, 2026
7b5f2c9
feat: Graph compatibility preparations for LU factorization
robtaylor Mar 9, 2026
f196193
feat: Replace cuSOLVER getrf with graph-capture-compatible LU kernel
robtaylor Mar 9, 2026
0b6e9b4
feat: Include upper triangle device pointers in deviceAccessor()
robtaylor Mar 10, 2026
f12dacb
fix: Make bundled static library target GLOBAL for FetchContent visib…
robtaylor Mar 10, 2026
62a7afe
feat: Add persistent NumericCtx/SolveCtx, device-resident pivots, and…
robtaylor Mar 10, 2026
d396e50
feat: Add disableAllStats() to SymbolicCtx for production FFI paths
robtaylor Mar 11, 2026
176570c
Fix intra-lump upper triangle Schur complement update in LU factoriza…
robtaylor Mar 10, 2026
f22fafe
Add scipy cross-validation for ring oscillator test
robtaylor Mar 11, 2026
964b647
Add CUDA/Metal scipy cross-validation in CI
robtaylor Mar 11, 2026
d2a3ee3
Fix CUDA build and macOS pip for scipy cross-validation
robtaylor Mar 11, 2026
48bb1e8
fix: Use relative tolerance for Metal factor tests
robtaylor Mar 11, 2026
2673fd6
feat: Propagate CUDA stream to all kernel launches and async ops
robtaylor Mar 11, 2026
8218f67
feat: Deterministic LU sparse elimination on Metal (two-phase)
robtaylor Mar 11, 2026
ea59c8f
feat: Deterministic Cholesky sparse elimination on Metal (two-phase)
robtaylor Mar 11, 2026
42f9c4d
CUDA: two-phase deterministic sparse elimination
robtaylor Mar 11, 2026
c400cb0
Tighten MetalFactorTest tolerance now that sparse elim is deterministic
robtaylor Mar 11, 2026
dbbf79c
feat: Add persistent context support to Metal backend
robtaylor Mar 11, 2026
4033d77
feat: Add external encoder API for streamable Metal solve
robtaylor Mar 11, 2026
6f7bc14
Add diagnostic test to break down Metal factor error sources
robtaylor Mar 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
529 changes: 529 additions & 0 deletions .claude/learnings.md

Large diffs are not rendered by default.

234 changes: 234 additions & 0 deletions .claude/narrative.md

Large diffs are not rendered by default.

105 changes: 105 additions & 0 deletions .github/workflows/cudss-profile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
name: cuDSS vs BaSpaCho Profiling

on:
workflow_dispatch:
inputs:
nsys_args:
description: 'Extra nsys arguments (e.g. --trace=cuda,nvtx,cublas)'
default: ''

env:
BUILD_TYPE: Release

jobs:
profile:
runs-on: [nvidia-runner-1]
container:
image: nvidia/cuda:12.6.3-devel-ubuntu24.04
options: --gpus all

steps:
- uses: actions/checkout@v4

- name: Cache CMake FetchContent dependencies
uses: actions/cache@v4
with:
path: build/_deps
key: deps-${{ runner.os }}-${{ runner.arch }}-${{ hashFiles('CMakeLists.txt') }}
restore-keys: |
deps-${{ runner.os }}-${{ runner.arch }}-

- name: Install build dependencies
run: |
apt-get update
apt-get install -y cmake build-essential libopenblas-dev \
nsight-systems-2025.5.2 cudss-cuda-12

- name: System info
run: |
echo "=== GPU Info ==="
nvidia-smi || echo "nvidia-smi not available"
echo "=== CUDA Version ==="
nvcc --version || echo "nvcc not available"
echo "=== nsys Version ==="
nsys --version || echo "nsys not available"
echo "=== cuDSS ==="
dpkg -l | grep cudss || echo "cuDSS packages not found"

- name: Configure CMake
run: |
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${{ env.BUILD_TYPE }} \
-DBASPACHO_USE_CUBLAS=ON \
-DBASPACHO_USE_METAL=OFF \
-DBASPACHO_USE_OPENCL=OFF \
-DBASPACHO_BUILD_TESTS=ON \
-DBASPACHO_BUILD_EXAMPLES=ON

- name: Build CudssBenchmarkTest
run: cmake --build build --config ${{ env.BUILD_TYPE }} --target CudssBenchmarkTest -j"$(nproc)"

- name: Run nsys profile
run: |
export BASPACHO_MTX_DIR=test_data/c6288_jacobian

nsys profile \
--trace=cuda,nvtx,osrt \
--stats=true \
--output /tmp/cudss_vs_baspacho \
${{ inputs.nsys_args }} \
./build/baspacho/tests/CudssBenchmarkTest

- name: Generate stats summary
run: |
echo "## cuDSS vs BaSpaCho Profiling Results" >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"

echo "### NVTX Range Summary" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
nsys stats --report nvtxsum /tmp/cudss_vs_baspacho.nsys-rep 2>&1 | head -60 >> "$GITHUB_STEP_SUMMARY" || true
echo '```' >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"

echo "### CUDA Kernel Summary" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
nsys stats --report cudakernsum /tmp/cudss_vs_baspacho.nsys-rep 2>&1 | head -80 >> "$GITHUB_STEP_SUMMARY" || true
echo '```' >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"

echo "### CUDA API Summary" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
nsys stats --report cudaapisum /tmp/cudss_vs_baspacho.nsys-rep 2>&1 | head -60 >> "$GITHUB_STEP_SUMMARY" || true
echo '```' >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"

echo "### Memory Transfer Summary" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
nsys stats --report cudamemcpysum /tmp/cudss_vs_baspacho.nsys-rep 2>&1 | head -40 >> "$GITHUB_STEP_SUMMARY" || true
echo '```' >> "$GITHUB_STEP_SUMMARY"

- name: Upload nsys report
uses: actions/upload-artifact@v4
with:
name: nsys-profile-cudss-vs-baspacho
path: /tmp/cudss_vs_baspacho.nsys-rep
retention-days: 30
172 changes: 172 additions & 0 deletions .github/workflows/macos-metal.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
name: macOS Metal Performance

on:
push:
branches: [main]
pull_request:
branches: [main]
workflow_dispatch:
inputs:
benchmark_iterations:
description: 'Number of benchmark iterations per problem'
required: false
default: '3'
problem_filter:
description: 'Regex filter for problems (empty = all)'
required: false
default: ''

env:
BUILD_TYPE: Release

jobs:
build-and-test:
runs-on: macos-latest-xlarge # Bare-metal Apple Silicon with GPU access

steps:
- uses: actions/checkout@v4

- name: Cache CMake FetchContent dependencies
uses: actions/cache@v4
with:
path: build/_deps
key: deps-${{ runner.os }}-${{ runner.arch }}-${{ hashFiles('CMakeLists.txt') }}
restore-keys: |
deps-${{ runner.os }}-${{ runner.arch }}-

- name: Print system info
run: |
echo "=== System Info ==="
uname -a
sysctl -n machdep.cpu.brand_string || echo "CPU info not available"
system_profiler SPHardwareDataType | grep -E "Chip|Memory|Cores"

- name: Install dependencies
run: |
brew install openblas llvm
echo "OpenBLAS installed at: $(brew --prefix openblas)"
echo "LLVM installed at: $(brew --prefix llvm)"

- name: Configure CMake
run: |
export PATH="$(brew --prefix llvm)/bin:$PATH"
export LDFLAGS="-L$(brew --prefix openblas)/lib"
export CPPFLAGS="-I$(brew --prefix openblas)/include"
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${{ env.BUILD_TYPE }} \
-DBASPACHO_USE_CUBLAS=OFF \
-DBASPACHO_USE_METAL=ON \
-DBASPACHO_BUILD_TESTS=ON \
-DBASPACHO_BUILD_EXAMPLES=ON \
-DBLA_VENDOR=OpenBLAS \
-DCMAKE_PREFIX_PATH="$(brew --prefix openblas)"

- name: Build
run: |
export PATH="$(brew --prefix llvm)/bin:$PATH"
cmake --build build --config "${{ env.BUILD_TYPE }}" -j"$(sysctl -n hw.ncpu)"

- name: Run all tests (CPU + Metal GPU)
run: |
ctest --test-dir build --output-on-failure --parallel "$(sysctl -n hw.ncpu)"

benchmark:
runs-on: macos-latest-xlarge # Bare-metal Apple Silicon with GPU access
needs: build-and-test

steps:
- uses: actions/checkout@v4

- name: Cache CMake FetchContent dependencies
uses: actions/cache@v4
with:
path: build/_deps
key: deps-${{ runner.os }}-${{ runner.arch }}-${{ hashFiles('CMakeLists.txt') }}
restore-keys: |
deps-${{ runner.os }}-${{ runner.arch }}-

- name: Print system info
run: |
echo "=== System Info ==="
uname -a
sysctl -n machdep.cpu.brand_string || echo "CPU info not available"
system_profiler SPHardwareDataType | grep -E "Chip|Memory|Cores"

- name: Install dependencies
run: |
brew install openblas llvm
echo "OpenBLAS installed at: $(brew --prefix openblas)"
echo "LLVM installed at: $(brew --prefix llvm)"

- name: Configure CMake
run: |
export PATH="$(brew --prefix llvm)/bin:$PATH"
export LDFLAGS="-L$(brew --prefix openblas)/lib"
export CPPFLAGS="-I$(brew --prefix openblas)/include"
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${{ env.BUILD_TYPE }} \
-DBASPACHO_USE_CUBLAS=OFF \
-DBASPACHO_USE_METAL=ON \
-DBASPACHO_BUILD_TESTS=OFF \
-DBASPACHO_BUILD_EXAMPLES=ON \
-DBLA_VENDOR=OpenBLAS \
-DCMAKE_PREFIX_PATH="$(brew --prefix openblas)"

- name: Build
run: |
export PATH="$(brew --prefix llvm)/bin:$PATH"
cmake --build build --config "${{ env.BUILD_TYPE }}" -j"$(sysctl -n hw.ncpu)"

- name: Run Performance Benchmarks
run: |
mkdir -p benchmark_results
cd benchmark_results

# Set benchmark parameters
ITERATIONS="${{ github.event.inputs.benchmark_iterations || '3' }}"
PROBLEM_FILTER="${{ github.event.inputs.problem_filter || '' }}"

echo "=== Running BaSpaCho Benchmarks ===" | tee benchmark_output.txt
echo "Iterations per problem: $ITERATIONS" | tee -a benchmark_output.txt
echo "Date: $(date)" | tee -a benchmark_output.txt
echo "" | tee -a benchmark_output.txt

# Run CPU baseline benchmark
echo "=== CPU Baseline (BLAS) ===" | tee -a benchmark_output.txt
if [ -n "$PROBLEM_FILTER" ]; then
../build/baspacho/benchmarking/bench -S "BLAS" -R "$PROBLEM_FILTER" -n $ITERATIONS 2>&1 | tee -a benchmark_output.txt
else
../build/baspacho/benchmarking/bench -S "BLAS" -n $ITERATIONS 2>&1 | tee -a benchmark_output.txt
fi

# Run Metal GPU benchmark
echo "" | tee -a benchmark_output.txt
echo "=== Metal GPU ===" | tee -a benchmark_output.txt
if [ -n "$PROBLEM_FILTER" ]; then
../build/baspacho/benchmarking/bench -S "Metal" -R "$PROBLEM_FILTER" -n $ITERATIONS 2>&1 | tee -a benchmark_output.txt || echo "Metal benchmark failed" | tee -a benchmark_output.txt
else
../build/baspacho/benchmarking/bench -S "Metal" -n $ITERATIONS 2>&1 | tee -a benchmark_output.txt || echo "Metal benchmark failed" | tee -a benchmark_output.txt
fi

- name: Upload Benchmark Results
uses: actions/upload-artifact@v4
with:
name: benchmark-results-${{ github.sha }}
path: benchmark_results/
retention-days: 30

- name: Post Benchmark Summary
run: |
{
echo "## Benchmark Results"
echo ""
echo "### System Info"
echo "- Runner: macos-latest-xlarge (Apple Silicon)"
echo "- Chip: $(system_profiler SPHardwareDataType | grep 'Chip' | cut -d':' -f2 | xargs)"
echo "- Memory: $(system_profiler SPHardwareDataType | grep 'Memory' | cut -d':' -f2 | xargs)"
echo ""
echo "### Results"
echo "\`\`\`"
cat benchmark_results/benchmark_output.txt
echo "\`\`\`"
} >> "$GITHUB_STEP_SUMMARY"
Loading
Loading