facebookresearch · robtaylor · Dec 23, 2025 · Dec 25, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,179 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+BaSpaCho (Batched Sparse Cholesky) is a high-performance direct solver for symmetric positive-definite sparse matrices. It implements supernodal Cholesky decomposition with CUDA and Metal support for batched GPU solving.
+
+## Build Commands
+
+**Configure (CPU-only, using OpenBLAS):**
+```bash
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0
+```
+
+**Configure (with CUDA):**
+```bash
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
+```
+
+**Configure (with Intel MKL):**
+```bash
+. /opt/intel/oneapi/setvars.sh
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBLA_VENDOR=Intel10_64lp
+```
+
+**Configure (with Apple Metal, macOS only):**
+```bash
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0 -DBASPACHO_USE_METAL=1 -DBLA_VENDOR=Apple
+```
+
+**Build:**
+```bash
+cmake --build build -- -j16
+```
+
+**Run all tests:**
+```bash
+ctest --test-dir build
+```
+
+**List available tests:**
+```bash
+ctest --test-dir build --show-only
+```
+
+**Run a single test:**
+```bash
+ctest --test-dir build -R <test_name>
+```
+
+**Using pixi (alternative):**
+```bash
+pixi run prepare       # Configure without CUDA
+pixi run build         # Build
+pixi run test          # Run tests
+pixi run build_and_test # Full workflow
+```
+
+## Code Style
+
+- C++17 standard
+- Google style base with modifications (see `.clang-format`)
+- Column limit: 100 characters
+- Pointer alignment: left (`int* ptr` not `int *ptr`)
+- Format with: `clang-format -i <file>`
+- Pre-commit hook runs clang-format automatically
+
+## Architecture
+
+### Core Data Structures
+
+**SparseStructure** (`baspacho/baspacho/SparseStructure.h`): CSR-format sparse structure storing `ptrs` and `inds` vectors representing block indices (not individual elements).
+
+**CoalescedBlockMatrixSkel** (`baspacho/baspacho/CoalescedBlockMatrix.h`): Block matrix skeleton with coalesced columns. Key terminology:
+- **span**: basic parameter block grouping
+- **lump**: aggregation of consecutive spans
+- **chain**: span rows × lump cols
+- **board**: all spans in a lump of rows × lump cols
+
+**Solver** (`baspacho/baspacho/Solver.h`): Main interface created via `createSolver()`. Provides:
+- `factor()`: Cholesky factorization
+- `solve()`, `solveL()`, `solveLt()`: triangular solves
+- `factorUpTo()`, `solveLUpTo()`: partial factorization for marginals
+- Backends: `BackendRef`, `BackendFast`, `BackendCuda`, `BackendMetal`, `BackendOpenCL`
+
+### Directory Structure
+
+```
+baspacho/
+  baspacho/       # Core library sources
+  testing/        # Test utilities (TestingMatGen, TestingUtils)
+  tests/          # Unit tests (gtest)
+  benchmarking/   # Performance benchmarks (bench, BAL_bench)
+  examples/       # Example applications (Optimizer, PCG)
+```
+
+### Key CMake Options
+
+- `BASPACHO_USE_CUBLAS`: Enable CUDA support (default: ON)
+- `BASPACHO_USE_METAL`: Enable Apple Metal support (default: OFF, macOS only, float only)
+- `BASPACHO_USE_OPENCL`: Enable OpenCL support with CLBlast (default: OFF, experimental)
+- `BASPACHO_USE_BLAS`: Enable BLAS support (default: ON)
+- `BASPACHO_CUDA_ARCHS`: CUDA architectures ("detect", "torch", or explicit list like "60;70;75")
+- `BASPACHO_USE_SUITESPARSE_AMD`: Use SuiteSparse AMD instead of Eigen's implementation
+- `BASPACHO_BUILD_TESTS`: Build tests (default: ON)
+- `BASPACHO_BUILD_EXAMPLES`: Build examples/benchmarks (default: ON)
+- `BLA_VENDOR`: BLAS implementation (ATLAS, OpenBLAS, Intel10_64lp_seq, Apple, etc.)
+
+## GPU Backend Notes
+
+### Metal Backend (Apple Silicon)
+
+The Metal backend provides GPU acceleration on Apple Silicon Macs (M1, M2, M3, etc.).
+
+**Important: Float-only precision.** Apple Silicon GPUs lack native double-precision FP64 support. The Metal backend only supports `float` operations. Attempting to use `double` will result in a clear runtime error.
+
+```cpp
+// Metal backend usage (float only)
+Settings settings;
+settings.backend = BackendMetal;
+auto solver = createSolver<float>(paramSize, structure, settings);
+
+// Use MetalMirror for GPU memory management
+MetalMirror<float> dataGpu(hostData);
+solver.factor(dataGpu.ptr());
+dataGpu.get(hostData);  // Copy back to CPU
+```
+
+For double precision, use `BackendFast` (CPU with BLAS) or `BackendCuda` (NVIDIA GPU).
+
+### CUDA Backend (NVIDIA)
+
+The CUDA backend supports both float and double precision on NVIDIA GPUs with compute capability >= 6.0.
+
+### OpenCL Backend (Experimental)
+
+The OpenCL backend provides portable GPU acceleration using CLBlast for BLAS operations.
+
+**Status:** Experimental. Currently uses CPU fallbacks for most operations. The infrastructure is in place but full GPU kernel execution is not yet implemented.
+
+**Requirements:**
+- OpenCL 1.2+ runtime
+- CLBlast library
+
+```cpp
+// OpenCL backend usage
+Settings settings;
+settings.backend = BackendOpenCL;
+auto solver = createSolver<float>(paramSize, structure, settings);
+```
+
+For production use, prefer CUDA (NVIDIA) or Metal (Apple Silicon) backends.
+
+## Dependencies
+
+Fetched automatically by CMake:
+- Eigen 3.4.0
+- GoogleTest
+- dispenso (multithreading)
+- Sophus (for BA examples only)
+
+Optional external:
+- CUDA Toolkit (10.2+, architecture >=60 for double atomics)
+- CHOLMOD (SuiteSparse) - for benchmarking comparisons
+- AMD (SuiteSparse) - alternative reordering algorithm
+
+## Running Benchmarks
+
+```bash
+# Compare with CHOLMOD baseline
+build/baspacho/benchmarking/bench -B 1_CHOLMOD
+
+# Bundle Adjustment problem
+build/baspacho/benchmarking/BAL_bench -i ~/BAL/problem-871-527480-pre.txt
+
+# Collect timing statistics for computation model fitting
+build/baspacho/benchmarking/bench -B 1_CHOLMOD -Z
+```
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -15,6 +15,9 @@ set(BASPACHO_CXX_FLAGS -Wall -Wextra -pedantic) # -O1 -g -fsanitize=address -sta
 message("* Build type: " ${CMAKE_BUILD_TYPE})
 include("${CMAKE_CURRENT_SOURCE_DIR}/cmake/BundleStaticLibrary.cmake")
 
+# Add cmake module path for custom find modules
+list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
+
 # general settings
 set(BASPACHO_BUILD_TESTS ON CACHE BOOL "If on (default), tests are build")
 set(BASPACHO_BUILD_EXAMPLES ON CACHE BOOL "If on (default), exampels/benchmarks are build")
@@ -110,6 +113,53 @@ if(BASPACHO_USE_CUBLAS)
   add_compile_definitions(BASPACHO_USE_CUBLAS)
 endif()
 
+# METAL (macOS/iOS GPU backend)
+set(BASPACHO_USE_METAL OFF CACHE BOOL "If on, Metal support is enabled (macOS/iOS only)")
+
+if(BASPACHO_USE_METAL)
+  message("${Cyan}==============================[ Metal ]==================================${ColourReset}")
+
+  if(NOT APPLE)
+    message(FATAL_ERROR "Metal backend is only supported on macOS/iOS")
+  endif()
+
+  # Find Metal frameworks
+  find_library(METAL_FRAMEWORK Metal REQUIRED)
+  find_library(MPS_FRAMEWORK MetalPerformanceShaders REQUIRED)
+  find_library(FOUNDATION_FRAMEWORK Foundation REQUIRED)
+
+  message("* Metal Framework: ${METAL_FRAMEWORK}")
+  message("* MPS Framework: ${MPS_FRAMEWORK}")
+  message("* Foundation Framework: ${FOUNDATION_FRAMEWORK}")
+
+  # Note: We do NOT call enable_language(OBJCXX) here because it causes
+  # cmake to treat ALL .m files as Objective-C++, which breaks projects
+  # that embed BaSpaCho and have their own Objective-C code (like IREE).
+  # Instead, we explicitly set compile flags for .mm files in the
+  # baspacho/baspacho/CMakeLists.txt.
+
+  add_compile_definitions(BASPACHO_USE_METAL)
+endif()
+
+# OpenCL (with CLBlast for BLAS operations)
+set(BASPACHO_USE_OPENCL OFF CACHE BOOL "If on, OpenCL support is enabled (with CLBlast)")
+
+if(BASPACHO_USE_OPENCL)
+  message("${Cyan}==============================[ OpenCL ]=================================${ColourReset}")
+
+  find_package(OpenCL REQUIRED)
+  message("* OpenCL Version: ${OpenCL_VERSION_STRING}")
+  message("* OpenCL Include: ${OpenCL_INCLUDE_DIRS}")
+  message("* OpenCL Libraries: ${OpenCL_LIBRARIES}")
+
+  # CLBlast provides high-performance BLAS for OpenCL
+  find_package(CLBlast REQUIRED)
+  message("* CLBlast Libraries: ${CLBLAST_LIBRARIES}")
+
+  include_directories(${OpenCL_INCLUDE_DIRS})
+  add_compile_definitions(BASPACHO_USE_OPENCL)
+endif()
+
 # BLAS. a few possibilities are:
 # * ATLAS
 # * OpenBLAS

diff --git a/README.md b/README.md
@@ -26,8 +26,8 @@ libraries. It is designed with optimization libraries for Levenberg-Marquardt in
 at reducing part of the complexity offering the best tool for the job.
 Compared to the library currently considered state of the art (CHOLMOD from SuiteSparse) it
 supports:
-* **pure-CUDA mode with support for batching,** ie. solving a batch of matrices with identical
-structure. This is to support differentiable optimization in Theseus library.
+* **GPU acceleration with CUDA and Metal,** supporting NVIDIA GPUs (CUDA) and Apple Silicon
+(Metal). CUDA supports batching for differentiable optimization in Theseus library.
 * **parallel elimination of independent sparse small elimination nodes,** essentially
 the operation done via "Schur-elimination trick" in mathematical optimization libraries such as Ceres.
 This is a workaround to the supernodal algorithm being a bad fit for the problem structure, so the
@@ -50,7 +50,9 @@ Libraries fetched automatical by build:
 * Sophus (only used in BA demo)
 
 Optional libraries:
-* CUDA toolkit (tested with CUDA 10.2/11.7), if not available must explicitly disable GPU support, see below.
+* CUDA toolkit (tested with CUDA 10.2/11.7), for NVIDIA GPU support. Disable with `-DBASPACHO_USE_CUBLAS=0`.
+* Metal (macOS only), for Apple Silicon GPU support. Enable with `-DBASPACHO_USE_METAL=1`.
+* OpenCL + CLBlast, for portable GPU support. Enable with `-DBASPACHO_USE_OPENCL=1`. (Experimental)
 * AMD, from SuiteSparse, can be used instead of Eigen for block reordering algorithm.
 * CHOLMOD, from SuiteSparse, used in benchmark as a reference for performance of sparse solvers.
 
@@ -82,16 +84,59 @@ Show tests:
 ctest --test-dir build --show-only
 ```
 
-### Cuda
-Cuda is enabled by default with BASPACHO_USE_CUBLAS option (on by default), add
+### CUDA
+CUDA is enabled by default with BASPACHO_USE_CUBLAS option (on by default), add
 `-DBASPACHO_USE_CUBLAS=0` to disable in build.
 May have to add `-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc` to allow build
 to find the cuda compiler.
-The Cuda architectures can be specified with e.g. `-DBASPACHO_CUDA_ARCHS="60;70;75"`,
+The CUDA architectures can be specified with e.g. `-DBASPACHO_CUDA_ARCHS="60;70;75"`,
 which also supports the options 'detect' (default) which detects the installed GPU arch,
 and 'torch' which fills in the architectures supported by PyTorch and >=60 (see below).
 
-### Blas
+### Metal (Apple Silicon)
+Metal support is available for macOS with Apple Silicon (M1/M2/M3/M4). To enable:
+```
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0 -DBASPACHO_USE_METAL=1 -DBLA_VENDOR=Apple
+```
+
+**Important: The Metal backend only supports single-precision (float) operations.**
+Apple Silicon GPUs lack native double-precision FP64 support. Attempting to use
+double precision with the Metal backend will result in a runtime error. For double
+precision, use the CPU backend (`BackendFast`) or CUDA (`BackendCuda`).
+
+The Metal backend uses:
+- Custom Metal compute shaders for sparse operations (factor_lumps, sparse_elim, assemble)
+- Metal Performance Shaders (MPS) for dense matrix multiply on large matrices
+- Eigen/Accelerate for Cholesky factorization (potrf) and triangular solve (trsm)
+
+### Backend Selection
+BaSpaCho supports automatic backend selection with `BackendAuto`:
+```cpp
+Settings settings;
+settings.backend = BackendAuto;  // Auto-detect best backend
+auto solver = createSolver<float>(paramSize, structure, settings);
+```
+
+The detection priority is: CUDA > Metal > OpenCL > CPU (BLAS).
+
+You can also use `detectBestBackend()` to query the recommended backend at runtime.
+
+### OpenCL (Experimental)
+OpenCL support provides a portable GPU backend using CLBlast for BLAS operations.
+To enable:
+```
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0 -DBASPACHO_USE_OPENCL=1
+```
+
+**Requirements:**
+- OpenCL 1.2+ runtime and development headers
+- [CLBlast](https://github.com/CNugteren/CLBlast) library
+
+**Note:** The OpenCL backend is experimental. It provides infrastructure for portable GPU
+acceleration but currently uses CPU fallbacks for most operations. For production use,
+prefer CUDA (NVIDIA) or Metal (Apple Silicon) backends.
+
+### BLAS
 The library used is specified in the CMake variable BLA_VENDOR,
 a few possibilities are:
 * ATLAS
@@ -187,11 +232,13 @@ because matrices naturally have a block-structure depending on parameters of dim
 cuda-kernel operations are designed around blocks so it's not ideal if you have some huge parameter blocks,
 the library will work best when the parameter blocks have sizes 1 to 12 (in a factor graph, generally you
 have many parameter blocks of the same type).
-* About determinism: assuming BLAS is deterministic, BaSpaCho will be 100% deterministic on the CPU, but
+* **CUDA determinism**: assuming BLAS is deterministic, BaSpaCho will be 100% deterministic on the CPU, but
 not on CUDA if there is any "sparse elimination" set of parameters, because both factor and solve operations
-use atomic addition for parallelism on the GPU. Also, a Cuda architecture >=6 is needed for atomicAdd
-on double numbers (this is the compute hardware architecture and not the version of Cuda, arch >=6 means
+use atomic addition for parallelism on the GPU. Also, a CUDA architecture >=6 is needed for atomicAdd
+on double numbers (this is the compute hardware architecture and not the version of CUDA, arch >=6 means
 you need Tesla P100 or GTX 1080-family, or newer. See
-[Cuda Architectures](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)).
+[CUDA Architectures](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)).
 Otherwise you will have to add define `CUDA_DOUBLE_ATOMIC_ADD_WORKAROUND` in order to enable the workaround
 in `CudaAtomic.cuh`.
+* **Metal precision**: The Metal backend only supports single-precision (float) due to Apple Silicon's
+limited FP64 support. Use `BackendFast` (CPU) or `BackendCuda` for double precision requirements.