Skip to content

feat: add simdgroup-optimized Metal kernels#172

Open
cluster2600 wants to merge 26 commits intoalibaba:mainfrom
cluster2600:feat/metal-simdgroup-kernels
Open

feat: add simdgroup-optimized Metal kernels#172
cluster2600 wants to merge 26 commits intoalibaba:mainfrom
cluster2600:feat/metal-simdgroup-kernels

Conversation

@cluster2600
Copy link
Contributor

@cluster2600 cluster2600 commented Feb 25, 2026

Summary

Adds 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes — no shared memory barriers needed.

Follow-up to #166 ("Future Work: SIMD optimization").

New kernels

Kernel Description
metal_l2_distance_simdgroup 32-thread cooperative L2 distance
metal_inner_product_simdgroup 32-thread cooperative dot product
metal_cosine_similarity_simdgroup Normalized inner product with 3 parallel reductions
metal_topk_simdgroup Per-query top-k selection via simd_min lane voting
metal_matmul_tiled Tiled matmul with threadgroup shared memory (16×16 tiles)
metal_normalize_simdgroup In-place L2 normalization

Dispatch model

  • Threadgroup size: 32 (one simdgroup)
  • Grid: (n_database, n_queries) threadgroups
  • Each simdgroup collaborates on one (query, database) pair, splitting dimensions across lanes

Fixes to existing kernels

  • Replace simd_make_float4float4 constructor (MSL compliance)
  • Add device address space qualifiers in metal_l2_distance_batch

Merge order

This PR shares a common base with #173, #175, #176. Recommended merge order: #172#173#175#176. Merging any one brings in the shared base commits; the rest then apply cleanly.

Test plan

  • Compiles cleanly with metal -std=metal3.1 -W -Werror on macOS with Xcode Metal toolchain
  • Integration test with Metal compute pipeline on Apple Silicon

- backends/detect.py: Hardware detection
- backends/gpu.py: FAISS GPU integration
- backends/quantization.py: Product Quantization
- backends/opq.py: OPQ + Scalar Quantization
- backends/search.py: Search optimization
- backends/hnsw.py: HNSW implementation
- backends/apple_silicon.py: Apple Silicon optimization
- backends/benchmark.py: Benchmarks

Internal sprint work - not for upstream PR.
- ShardManager for vector sharding
- DistributedIndex with scatter-gather queries
- QueryRouter for routing strategies
- ResultMerger for merging results from shards
- Support for hash, range, and random sharding
- Add README.md with full API documentation
- Add BENCHMARK_README.md with benchmark results
- Add test_backends.py with comprehensive tests
- Adjust k to avoid sampling errors
- Simplify k-means implementation
- Fix codebooks shape
Based on cuVS documentation:
- Support for CAGRA, IVF-PQ, HNSW algorithms
- 12x faster builds, 8x lower latency target
- Dynamic batching for CAGRA
Based on cuVS documentation:
- IVF-PQ: 12x faster builds, 8x lower latency
- CAGRA: 10x latency with dynamic batching, 8x throughput
- Both support fallback when cuVS not available
- 9x speedup target vs CPU
- Compatible with DiskANN
Based on arXiv:2401.11324:
- Synthetic clustered data generation
- FAISS CPU/GPU/IVF-PQ benchmarks
- cuVS placeholder benchmarks
- Results output to markdown
S3: GPU-PIM collaboration research
S4: Memory coalescing kernel (2-8x speedup)
S5: Apple ANE optimization guide
S6: ANE vs MPS benchmark
S7: Graph reordering (15% QPS gain)
S8: PIM evaluation framework

All based on scientific papers.
1. cuVS C++ bindings (zvec_cuvs.h)
   - IVFPQ, CAGRA, HNSW index classes
   - Template-based for float/uint8_t/int8_t

2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu)
   - Coalesced L2 distance (2-8x speedup)
   - Warp-level reductions
   - FP16 support
   - Tiled shared memory version

3. Metal MPS kernels (distance.metal)
   - L2 distance with SIMD/NEON
   - FP16 support for Apple Silicon
   - Batch processing
   - Matrix multiplication

All based on scientific papers.
1. SIMD CPU optimization (simd_distance.h)
   - SSE2, AVX2 for x86
   - NEON for ARM/Apple Silicon
   - 4-16x speedup expected

2. CMake build system (CMakeLists.txt)
   - CUDA coalesced kernels
   - Metal shaders
   - SIMD CPU
   - Optional cuVS integration

3. Graph-based ANN (graph_ann.h)
   - CAGRA-like implementation
   - NN-Descent graph construction
   - Hierarchical search
1. FastScan (simd_distance.h)
   - SIMD-optimized Product Quantization
   - AVX2 distance computation
   - Bitonic sort for k-selection

2. Vamana Graph (vamana.h)
   - DiskANN algorithm
   - Robust to search parameters
   - Used in Azure AI Search

3. NUMA-aware (numa.h)
   - Per-NUMA-node allocation
   - Work-stealing thread pool
   - 6-20x speedup on multi-socket

Based on papers:
- Quake (OSDI 2025): NUMA-aware partitioning
- FAISS (2024): FastScan SIMD optimization
- DiskANN: Vamana graph
1. Lock-free concurrent structures (lockfree.h)
   - LockFreeVector (Stroustrup design)
   - AtomicIndex for HNSW
   - Hazard pointer reclamation

2. Memory pool optimizations (memory_pool.h)
   - Aligned allocator (cache-line, huge pages)
   - Object pool
   - Slab allocator
   - SoA layout

3. Batch processing (batch.h)
   - Transposed matrix for PQ (30-50% faster)
   - Loop unrolling
   - AVX-512 support
   - PQ distance tables

Based on:
- FAISS optimization guide
- Stroustrup lock-free vector
- OptiTrust paper (2024)
Add 6 new Metal compute kernels using simdgroup cooperative intrinsics
(simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions
across 32 SIMD lanes without shared memory barriers:

- metal_l2_distance_simdgroup: cooperative L2 distance
- metal_inner_product_simdgroup: cooperative dot product
- metal_cosine_similarity_simdgroup: normalized inner product
- metal_topk_simdgroup: per-query top-k selection via simd_min
- metal_matmul_tiled: tiled matmul with threadgroup shared memory
- metal_normalize_simdgroup: in-place L2 normalization

Also fixes existing kernels:
- Replace simd_make_float4 with float4 constructor (MSL compliance)
- Add device address space qualifiers in batch kernel

Tested: compiles cleanly with metal -std=metal3.1 -W -Werror.
Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and
  cagra.search(SearchParams, index, queries, k) instead of
  the non-existent Index().build() / Index().search() methods
- cuvs_ivf_pq.py: same pattern fix, plus correct import path
  (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq)
- Both backends now convert numpy queries to cupy device arrays
  before search (cuVS requires CUDA-compatible memory)

Tested on RTX 4090:
- cuVS CAGRA: 43K QPS (50K vectors, dim=128)
- cuVS IVF-PQ: 45K QPS (50K vectors, dim=128)
- FAISS GPU: 529K QPS (50K vectors, dim=128, flat)

Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py.
Update get_optimal_backend() priority chain:
  C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy

Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
@cluster2600
Copy link
Contributor Author

Discussion issue opened: #177 — feedback welcome before review.

@Cuiyus
Copy link
Collaborator

Cuiyus commented Feb 27, 2026

@greptile

@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR adds hardware-accelerated Metal compute kernels for Apple Silicon, introducing 6 new simdgroup-optimized kernels that leverage cooperative SIMD intrinsics (simd_sum, simd_min, simd_shuffle) for efficient vector operations without shared memory barriers.

Key Changes

  • New Metal kernels (distance.metal): L2 distance, inner product, cosine similarity, top-k selection, tiled matmul, and normalization using 32-thread simdgroups
  • Backend detection system (detect.py): Prioritizes C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU
  • cuVS integration: Python wrappers for CAGRA (graph ANN) and IVF-PQ with correct RAPIDS API usage
  • Apple Silicon support (apple_silicon.py): PyTorch MPS and Accelerate framework integration
  • CMake modernization: Added CUDA architecture flags, Metal support, and optional cuVS integration
  • Comprehensive testing: New test suite and GPU benchmark notebooks

Issues Found

  • metal_topk_simdgroup kernel has O(k²) complexity due to repeated scans of already-selected indices (lines 419-421)

The implementation follows proper Metal Shading Language conventions and correctly uses simdgroup cooperative reductions for hardware acceleration.

Confidence Score: 4/5

  • This PR is safe to merge with proper testing on Apple Silicon hardware
  • The implementation follows Metal best practices and uses correct API patterns for cuVS integration. The only concern is the O(k²) topk selection algorithm which affects performance but not correctness. Integration testing on actual Apple Silicon hardware is needed to verify Metal kernel compilation and runtime behavior.
  • Pay attention to src/ailego/gpu/metal/distance.metal for the topk kernel performance characteristics

Important Files Changed

Filename Overview
src/ailego/gpu/metal/distance.metal New Metal kernels with simdgroup intrinsics for L2, dot product, cosine, matmul, topk, and normalize. Topk has O(k²) selection overhead.
python/zvec/backends/detect.py Hardware detection with backend priority: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy
python/zvec/backends/apple_silicon.py Apple Silicon backend using PyTorch MPS or Accelerate framework for matrix ops and L2 distance
python/zvec/backends/cuvs_cagra.py cuVS CAGRA wrapper using correct RAPIDS API (metric="sqeuclidean"), with CuPy array conversions
python/zvec/backends/cuvs_ivf_pq.py cuVS IVF-PQ wrapper with proper parameter passing and CuPy device array handling
src/ailego/gpu/cuvs/zvec_cuvs.h C++ cuVS bindings header defining IVF-PQ, CAGRA, and HNSW index interfaces with parameters
src/CMakeLists.txt CMake config rewritten to support CUDA, Metal, and optional cuVS with proper architecture flags
python/tests/test_backends.py New test suite covering hardware detection, GPU index operations, and quantization (PQ/OPQ)

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Vector Search Request] --> Detect[Hardware Detection]
    Detect --> CheckCppCuvs{C++ cuVS<br/>Available?}
    CheckCppCuvs -->|Yes| UseCppCuvs[Use C++ cuVS<br/>CAGRA/IVF-PQ]
    CheckCppCuvs -->|No| CheckPyCuvs{Python cuVS<br/>Available?}
    CheckPyCuvs -->|Yes| UsePyCuvs[Use Python cuVS<br/>CuPy arrays]
    CheckPyCuvs -->|No| CheckFaissGpu{FAISS GPU +<br/>NVIDIA GPU?}
    CheckFaissGpu -->|Yes| UseFaissGpu[Use FAISS GPU]
    CheckFaissGpu -->|No| CheckMps{Apple Silicon<br/>MPS?}
    CheckMps -->|Yes| UseMps[Use Metal Kernels<br/>simdgroup ops]
    CheckMps -->|No| CheckFaissCpu{FAISS CPU?}
    CheckFaissCpu -->|Yes| UseFaissCpu[Use FAISS CPU<br/>+ Accelerate]
    CheckFaissCpu -->|No| UseNumpy[Fallback to NumPy]
    
    UseCppCuvs --> Execute[Execute Search]
    UsePyCuvs --> Execute
    UseFaissGpu --> Execute
    UseMps --> MetalKernels[Metal Kernels:<br/>L2/cosine/topk]
    MetalKernels --> Execute
    UseFaissCpu --> Execute
    UseNumpy --> Execute
    
    Execute --> Results[Return distances<br/>+ indices]
    
    style UseMps fill:#a8dadc
    style MetalKernels fill:#a8dadc
    style UseCppCuvs fill:#f1faee
    style UsePyCuvs fill:#f1faee
Loading

Last reviewed commit: b08a835

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

41 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +419 to +421
for (uint prev = 0; prev < ki; prev++) {
if (out_i[prev] == i) { already = true; break; }
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(k²) complexity checking already-selected indices. For each of k rounds, and for each candidate, this checks up to k previous selections. Consider using threadgroup shared memory for a bitmask or selected indices array to avoid repeated scans.

Suggested change
for (uint prev = 0; prev < ki; prev++) {
if (out_i[prev] == i) { already = true; break; }
}
// Use threadgroup shared memory for selected indices (requires changes to kernel signature)
// threadgroup uint selected_mask[MAX_DATABASE / 32];
// Check: bool already = (selected_mask[i / 32] & (1u << (i % 32))) != 0;

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants