Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 179 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

BaSpaCho (Batched Sparse Cholesky) is a high-performance direct solver for symmetric positive-definite sparse matrices. It implements supernodal Cholesky decomposition with CUDA and Metal support for batched GPU solving.

## Build Commands

**Configure (CPU-only, using OpenBLAS):**
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0
```

**Configure (with CUDA):**
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
```

**Configure (with Intel MKL):**
```bash
. /opt/intel/oneapi/setvars.sh
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBLA_VENDOR=Intel10_64lp
```

**Configure (with Apple Metal, macOS only):**
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0 -DBASPACHO_USE_METAL=1 -DBLA_VENDOR=Apple
```

**Build:**
```bash
cmake --build build -- -j16
```

**Run all tests:**
```bash
ctest --test-dir build
```

**List available tests:**
```bash
ctest --test-dir build --show-only
```

**Run a single test:**
```bash
ctest --test-dir build -R <test_name>
```

**Using pixi (alternative):**
```bash
pixi run prepare # Configure without CUDA
pixi run build # Build
pixi run test # Run tests
pixi run build_and_test # Full workflow
```

## Code Style

- C++17 standard
- Google style base with modifications (see `.clang-format`)
- Column limit: 100 characters
- Pointer alignment: left (`int* ptr` not `int *ptr`)
- Format with: `clang-format -i <file>`
- Pre-commit hook runs clang-format automatically

## Architecture

### Core Data Structures

**SparseStructure** (`baspacho/baspacho/SparseStructure.h`): CSR-format sparse structure storing `ptrs` and `inds` vectors representing block indices (not individual elements).

**CoalescedBlockMatrixSkel** (`baspacho/baspacho/CoalescedBlockMatrix.h`): Block matrix skeleton with coalesced columns. Key terminology:
- **span**: basic parameter block grouping
- **lump**: aggregation of consecutive spans
- **chain**: span rows × lump cols
- **board**: all spans in a lump of rows × lump cols

**Solver** (`baspacho/baspacho/Solver.h`): Main interface created via `createSolver()`. Provides:
- `factor()`: Cholesky factorization
- `solve()`, `solveL()`, `solveLt()`: triangular solves
- `factorUpTo()`, `solveLUpTo()`: partial factorization for marginals
- Backends: `BackendRef`, `BackendFast`, `BackendCuda`, `BackendMetal`, `BackendOpenCL`

### Directory Structure

```
baspacho/
baspacho/ # Core library sources
testing/ # Test utilities (TestingMatGen, TestingUtils)
tests/ # Unit tests (gtest)
benchmarking/ # Performance benchmarks (bench, BAL_bench)
examples/ # Example applications (Optimizer, PCG)
```

### Key CMake Options

- `BASPACHO_USE_CUBLAS`: Enable CUDA support (default: ON)
- `BASPACHO_USE_METAL`: Enable Apple Metal support (default: OFF, macOS only, float only)
- `BASPACHO_USE_OPENCL`: Enable OpenCL support with CLBlast (default: OFF, experimental)
- `BASPACHO_USE_BLAS`: Enable BLAS support (default: ON)
- `BASPACHO_CUDA_ARCHS`: CUDA architectures ("detect", "torch", or explicit list like "60;70;75")
- `BASPACHO_USE_SUITESPARSE_AMD`: Use SuiteSparse AMD instead of Eigen's implementation
- `BASPACHO_BUILD_TESTS`: Build tests (default: ON)
- `BASPACHO_BUILD_EXAMPLES`: Build examples/benchmarks (default: ON)
- `BLA_VENDOR`: BLAS implementation (ATLAS, OpenBLAS, Intel10_64lp_seq, Apple, etc.)

## GPU Backend Notes

### Metal Backend (Apple Silicon)

The Metal backend provides GPU acceleration on Apple Silicon Macs (M1, M2, M3, etc.).

**Important: Float-only precision.** Apple Silicon GPUs lack native double-precision FP64 support. The Metal backend only supports `float` operations. Attempting to use `double` will result in a clear runtime error.

```cpp
// Metal backend usage (float only)
Settings settings;
settings.backend = BackendMetal;
auto solver = createSolver<float>(paramSize, structure, settings);

// Use MetalMirror for GPU memory management
MetalMirror<float> dataGpu(hostData);
solver.factor(dataGpu.ptr());
dataGpu.get(hostData); // Copy back to CPU
```

For double precision, use `BackendFast` (CPU with BLAS) or `BackendCuda` (NVIDIA GPU).

### CUDA Backend (NVIDIA)

The CUDA backend supports both float and double precision on NVIDIA GPUs with compute capability >= 6.0.

### OpenCL Backend (Experimental)

The OpenCL backend provides portable GPU acceleration using CLBlast for BLAS operations.

**Status:** Experimental. Currently uses CPU fallbacks for most operations. The infrastructure is in place but full GPU kernel execution is not yet implemented.

**Requirements:**
- OpenCL 1.2+ runtime
- CLBlast library

```cpp
// OpenCL backend usage
Settings settings;
settings.backend = BackendOpenCL;
auto solver = createSolver<float>(paramSize, structure, settings);
```

For production use, prefer CUDA (NVIDIA) or Metal (Apple Silicon) backends.

## Dependencies

Fetched automatically by CMake:
- Eigen 3.4.0
- GoogleTest
- dispenso (multithreading)
- Sophus (for BA examples only)

Optional external:
- CUDA Toolkit (10.2+, architecture >=60 for double atomics)
- CHOLMOD (SuiteSparse) - for benchmarking comparisons
- AMD (SuiteSparse) - alternative reordering algorithm

## Running Benchmarks

```bash
# Compare with CHOLMOD baseline
build/baspacho/benchmarking/bench -B 1_CHOLMOD

# Bundle Adjustment problem
build/baspacho/benchmarking/BAL_bench -i ~/BAL/problem-871-527480-pre.txt

# Collect timing statistics for computation model fitting
build/baspacho/benchmarking/bench -B 1_CHOLMOD -Z
```
50 changes: 50 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ set(BASPACHO_CXX_FLAGS -Wall -Wextra -pedantic) # -O1 -g -fsanitize=address -sta
message("* Build type: " ${CMAKE_BUILD_TYPE})
include("${CMAKE_CURRENT_SOURCE_DIR}/cmake/BundleStaticLibrary.cmake")

# Add cmake module path for custom find modules
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")

# general settings
set(BASPACHO_BUILD_TESTS ON CACHE BOOL "If on (default), tests are build")
set(BASPACHO_BUILD_EXAMPLES ON CACHE BOOL "If on (default), exampels/benchmarks are build")
Expand Down Expand Up @@ -110,6 +113,53 @@ if(BASPACHO_USE_CUBLAS)
add_compile_definitions(BASPACHO_USE_CUBLAS)
endif()

# METAL (macOS/iOS GPU backend)
set(BASPACHO_USE_METAL OFF CACHE BOOL "If on, Metal support is enabled (macOS/iOS only)")

if(BASPACHO_USE_METAL)
message("${Cyan}==============================[ Metal ]==================================${ColourReset}")

if(NOT APPLE)
message(FATAL_ERROR "Metal backend is only supported on macOS/iOS")
endif()

# Find Metal frameworks
find_library(METAL_FRAMEWORK Metal REQUIRED)
find_library(MPS_FRAMEWORK MetalPerformanceShaders REQUIRED)
find_library(FOUNDATION_FRAMEWORK Foundation REQUIRED)

message("* Metal Framework: ${METAL_FRAMEWORK}")
message("* MPS Framework: ${MPS_FRAMEWORK}")
message("* Foundation Framework: ${FOUNDATION_FRAMEWORK}")

# Note: We do NOT call enable_language(OBJCXX) here because it causes
# cmake to treat ALL .m files as Objective-C++, which breaks projects
# that embed BaSpaCho and have their own Objective-C code (like IREE).
# Instead, we explicitly set compile flags for .mm files in the
# baspacho/baspacho/CMakeLists.txt.

add_compile_definitions(BASPACHO_USE_METAL)
endif()

# OpenCL (with CLBlast for BLAS operations)
set(BASPACHO_USE_OPENCL OFF CACHE BOOL "If on, OpenCL support is enabled (with CLBlast)")

if(BASPACHO_USE_OPENCL)
message("${Cyan}==============================[ OpenCL ]=================================${ColourReset}")

find_package(OpenCL REQUIRED)
message("* OpenCL Version: ${OpenCL_VERSION_STRING}")
message("* OpenCL Include: ${OpenCL_INCLUDE_DIRS}")
message("* OpenCL Libraries: ${OpenCL_LIBRARIES}")

# CLBlast provides high-performance BLAS for OpenCL
find_package(CLBlast REQUIRED)
message("* CLBlast Libraries: ${CLBLAST_LIBRARIES}")

include_directories(${OpenCL_INCLUDE_DIRS})
add_compile_definitions(BASPACHO_USE_OPENCL)
endif()

# BLAS. a few possibilities are:
# * ATLAS
# * OpenBLAS
Expand Down
69 changes: 58 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ libraries. It is designed with optimization libraries for Levenberg-Marquardt in
at reducing part of the complexity offering the best tool for the job.
Compared to the library currently considered state of the art (CHOLMOD from SuiteSparse) it
supports:
* **pure-CUDA mode with support for batching,** ie. solving a batch of matrices with identical
structure. This is to support differentiable optimization in Theseus library.
* **GPU acceleration with CUDA and Metal,** supporting NVIDIA GPUs (CUDA) and Apple Silicon
(Metal). CUDA supports batching for differentiable optimization in Theseus library.
* **parallel elimination of independent sparse small elimination nodes,** essentially
the operation done via "Schur-elimination trick" in mathematical optimization libraries such as Ceres.
This is a workaround to the supernodal algorithm being a bad fit for the problem structure, so the
Expand All @@ -50,7 +50,9 @@ Libraries fetched automatical by build:
* Sophus (only used in BA demo)

Optional libraries:
* CUDA toolkit (tested with CUDA 10.2/11.7), if not available must explicitly disable GPU support, see below.
* CUDA toolkit (tested with CUDA 10.2/11.7), for NVIDIA GPU support. Disable with `-DBASPACHO_USE_CUBLAS=0`.
* Metal (macOS only), for Apple Silicon GPU support. Enable with `-DBASPACHO_USE_METAL=1`.
* OpenCL + CLBlast, for portable GPU support. Enable with `-DBASPACHO_USE_OPENCL=1`. (Experimental)
* AMD, from SuiteSparse, can be used instead of Eigen for block reordering algorithm.
* CHOLMOD, from SuiteSparse, used in benchmark as a reference for performance of sparse solvers.

Expand Down Expand Up @@ -82,16 +84,59 @@ Show tests:
ctest --test-dir build --show-only
```

### Cuda
Cuda is enabled by default with BASPACHO_USE_CUBLAS option (on by default), add
### CUDA
CUDA is enabled by default with BASPACHO_USE_CUBLAS option (on by default), add
`-DBASPACHO_USE_CUBLAS=0` to disable in build.
May have to add `-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc` to allow build
to find the cuda compiler.
The Cuda architectures can be specified with e.g. `-DBASPACHO_CUDA_ARCHS="60;70;75"`,
The CUDA architectures can be specified with e.g. `-DBASPACHO_CUDA_ARCHS="60;70;75"`,
which also supports the options 'detect' (default) which detects the installed GPU arch,
and 'torch' which fills in the architectures supported by PyTorch and >=60 (see below).

### Blas
### Metal (Apple Silicon)
Metal support is available for macOS with Apple Silicon (M1/M2/M3/M4). To enable:
```
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0 -DBASPACHO_USE_METAL=1 -DBLA_VENDOR=Apple
```

**Important: The Metal backend only supports single-precision (float) operations.**
Apple Silicon GPUs lack native double-precision FP64 support. Attempting to use
double precision with the Metal backend will result in a runtime error. For double
precision, use the CPU backend (`BackendFast`) or CUDA (`BackendCuda`).

The Metal backend uses:
- Custom Metal compute shaders for sparse operations (factor_lumps, sparse_elim, assemble)
- Metal Performance Shaders (MPS) for dense matrix multiply on large matrices
- Eigen/Accelerate for Cholesky factorization (potrf) and triangular solve (trsm)

### Backend Selection
BaSpaCho supports automatic backend selection with `BackendAuto`:
```cpp
Settings settings;
settings.backend = BackendAuto; // Auto-detect best backend
auto solver = createSolver<float>(paramSize, structure, settings);
```

The detection priority is: CUDA > Metal > OpenCL > CPU (BLAS).

You can also use `detectBestBackend()` to query the recommended backend at runtime.

### OpenCL (Experimental)
OpenCL support provides a portable GPU backend using CLBlast for BLAS operations.
To enable:
```
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBASPACHO_USE_CUBLAS=0 -DBASPACHO_USE_OPENCL=1
```

**Requirements:**
- OpenCL 1.2+ runtime and development headers
- [CLBlast](https://github.com/CNugteren/CLBlast) library

**Note:** The OpenCL backend is experimental. It provides infrastructure for portable GPU
acceleration but currently uses CPU fallbacks for most operations. For production use,
prefer CUDA (NVIDIA) or Metal (Apple Silicon) backends.

### BLAS
The library used is specified in the CMake variable BLA_VENDOR,
a few possibilities are:
* ATLAS
Expand Down Expand Up @@ -187,11 +232,13 @@ because matrices naturally have a block-structure depending on parameters of dim
cuda-kernel operations are designed around blocks so it's not ideal if you have some huge parameter blocks,
the library will work best when the parameter blocks have sizes 1 to 12 (in a factor graph, generally you
have many parameter blocks of the same type).
* About determinism: assuming BLAS is deterministic, BaSpaCho will be 100% deterministic on the CPU, but
* **CUDA determinism**: assuming BLAS is deterministic, BaSpaCho will be 100% deterministic on the CPU, but
not on CUDA if there is any "sparse elimination" set of parameters, because both factor and solve operations
use atomic addition for parallelism on the GPU. Also, a Cuda architecture >=6 is needed for atomicAdd
on double numbers (this is the compute hardware architecture and not the version of Cuda, arch >=6 means
use atomic addition for parallelism on the GPU. Also, a CUDA architecture >=6 is needed for atomicAdd
on double numbers (this is the compute hardware architecture and not the version of CUDA, arch >=6 means
you need Tesla P100 or GTX 1080-family, or newer. See
[Cuda Architectures](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)).
[CUDA Architectures](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)).
Otherwise you will have to add define `CUDA_DOUBLE_ATOMIC_ADD_WORKAROUND` in order to enable the workaround
in `CudaAtomic.cuh`.
* **Metal precision**: The Metal backend only supports single-precision (float) due to Apple Silicon's
limited FP64 support. Use `BackendFast` (CPU) or `BackendCuda` for double precision requirements.
Loading