feat: add simdgroup-optimized Metal kernels by cluster2600 · Pull Request #172 · alibaba/zvec

cluster2600 · 2026-02-25T15:36:50Z

Summary

Adds 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes — no shared memory barriers needed.

Follow-up to #166 ("Future Work: SIMD optimization").

New kernels

Kernel	Description
`metal_l2_distance_simdgroup`	32-thread cooperative L2 distance
`metal_inner_product_simdgroup`	32-thread cooperative dot product
`metal_cosine_similarity_simdgroup`	Normalized inner product with 3 parallel reductions
`metal_topk_simdgroup`	Per-query top-k selection via `simd_min` lane voting
`metal_matmul_tiled`	Tiled matmul with threadgroup shared memory (16×16 tiles)
`metal_normalize_simdgroup`	In-place L2 normalization

Dispatch model

Threadgroup size: 32 (one simdgroup)
Grid: (n_database, n_queries) threadgroups
Each simdgroup collaborates on one (query, database) pair, splitting dimensions across lanes

Fixes to existing kernels

Replace simd_make_float4 → float4 constructor (MSL compliance)
Add device address space qualifiers in metal_l2_distance_batch

Merge order

This PR shares a common base with #173, #175, #176. Recommended merge order: #172 → #173 → #175 → #176. Merging any one brings in the shared base commits; the rest then apply cleanly.

Test plan

Compiles cleanly with metal -std=metal3.1 -W -Werror on macOS with Xcode Metal toolchain
Integration test with Metal compute pipeline on Apple Silicon

- backends/detect.py: Hardware detection - backends/gpu.py: FAISS GPU integration - backends/quantization.py: Product Quantization - backends/opq.py: OPQ + Scalar Quantization - backends/search.py: Search optimization - backends/hnsw.py: HNSW implementation - backends/apple_silicon.py: Apple Silicon optimization - backends/benchmark.py: Benchmarks Internal sprint work - not for upstream PR.

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

- 9x speedup target vs CPU - Compatible with DiskANN

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

1. cuVS C++ bindings (zvec_cuvs.h) - IVFPQ, CAGRA, HNSW index classes - Template-based for float/uint8_t/int8_t 2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu) - Coalesced L2 distance (2-8x speedup) - Warp-level reductions - FP16 support - Tiled shared memory version 3. Metal MPS kernels (distance.metal) - L2 distance with SIMD/NEON - FP16 support for Apple Silicon - Batch processing - Matrix multiplication All based on scientific papers.

1. SIMD CPU optimization (simd_distance.h) - SSE2, AVX2 for x86 - NEON for ARM/Apple Silicon - 4-16x speedup expected 2. CMake build system (CMakeLists.txt) - CUDA coalesced kernels - Metal shaders - SIMD CPU - Optional cuVS integration 3. Graph-based ANN (graph_ann.h) - CAGRA-like implementation - NN-Descent graph construction - Hierarchical search

1. FastScan (simd_distance.h) - SIMD-optimized Product Quantization - AVX2 distance computation - Bitonic sort for k-selection 2. Vamana Graph (vamana.h) - DiskANN algorithm - Robust to search parameters - Used in Azure AI Search 3. NUMA-aware (numa.h) - Per-NUMA-node allocation - Work-stealing thread pool - 6-20x speedup on multi-socket Based on papers: - Quake (OSDI 2025): NUMA-aware partitioning - FAISS (2024): FastScan SIMD optimization - DiskANN: Vamana graph

1. Lock-free concurrent structures (lockfree.h) - LockFreeVector (Stroustrup design) - AtomicIndex for HNSW - Hazard pointer reclamation 2. Memory pool optimizations (memory_pool.h) - Aligned allocator (cache-line, huge pages) - Object pool - Slab allocator - SoA layout 3. Batch processing (batch.h) - Transposed matrix for PQ (30-50% faster) - Loop unrolling - AVX-512 support - PQ distance tables Based on: - FAISS optimization guide - Stroustrup lock-free vector - OptiTrust paper (2024)

Add 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes without shared memory barriers: - metal_l2_distance_simdgroup: cooperative L2 distance - metal_inner_product_simdgroup: cooperative dot product - metal_cosine_similarity_simdgroup: normalized inner product - metal_topk_simdgroup: per-query top-k selection via simd_min - metal_matmul_tiled: tiled matmul with threadgroup shared memory - metal_normalize_simdgroup: in-place L2 normalization Also fixes existing kernels: - Replace simd_make_float4 with float4 constructor (MSL compliance) - Add device address space qualifiers in batch kernel Tested: compiles cleanly with metal -std=metal3.1 -W -Werror. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and cagra.search(SearchParams, index, queries, k) instead of the non-existent Index().build() / Index().search() methods - cuvs_ivf_pq.py: same pattern fix, plus correct import path (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq) - Both backends now convert numpy queries to cupy device arrays before search (cuVS requires CUDA-compatible memory) Tested on RTX 4090: - cuVS CAGRA: 43K QPS (50K vectors, dim=128) - cuVS IVF-PQ: 45K QPS (50K vectors, dim=128) - FAISS GPU: 529K QPS (50K vectors, dim=128, flat) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

cluster2600 · 2026-02-26T08:05:53Z

Discussion issue opened: #177 — feedback welcome before review.

Cuiyus · 2026-02-27T02:19:19Z

@greptile

greptile-apps · 2026-02-27T02:28:22Z

Greptile Summary

This PR adds hardware-accelerated Metal compute kernels for Apple Silicon, introducing 6 new simdgroup-optimized kernels that leverage cooperative SIMD intrinsics (simd_sum, simd_min, simd_shuffle) for efficient vector operations without shared memory barriers.

Key Changes

New Metal kernels (distance.metal): L2 distance, inner product, cosine similarity, top-k selection, tiled matmul, and normalization using 32-thread simdgroups
Backend detection system (detect.py): Prioritizes C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU
cuVS integration: Python wrappers for CAGRA (graph ANN) and IVF-PQ with correct RAPIDS API usage
Apple Silicon support (apple_silicon.py): PyTorch MPS and Accelerate framework integration
CMake modernization: Added CUDA architecture flags, Metal support, and optional cuVS integration
Comprehensive testing: New test suite and GPU benchmark notebooks

Issues Found

metal_topk_simdgroup kernel has O(k²) complexity due to repeated scans of already-selected indices (lines 419-421)

The implementation follows proper Metal Shading Language conventions and correctly uses simdgroup cooperative reductions for hardware acceleration.

Confidence Score: 4/5

This PR is safe to merge with proper testing on Apple Silicon hardware
The implementation follows Metal best practices and uses correct API patterns for cuVS integration. The only concern is the O(k²) topk selection algorithm which affects performance but not correctness. Integration testing on actual Apple Silicon hardware is needed to verify Metal kernel compilation and runtime behavior.
Pay attention to src/ailego/gpu/metal/distance.metal for the topk kernel performance characteristics

Important Files Changed

Filename	Overview
src/ailego/gpu/metal/distance.metal	New Metal kernels with simdgroup intrinsics for L2, dot product, cosine, matmul, topk, and normalize. Topk has O(k²) selection overhead.
python/zvec/backends/detect.py	Hardware detection with backend priority: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy
python/zvec/backends/apple_silicon.py	Apple Silicon backend using PyTorch MPS or Accelerate framework for matrix ops and L2 distance
python/zvec/backends/cuvs_cagra.py	cuVS CAGRA wrapper using correct RAPIDS API (`metric="sqeuclidean"`), with CuPy array conversions
python/zvec/backends/cuvs_ivf_pq.py	cuVS IVF-PQ wrapper with proper parameter passing and CuPy device array handling
src/ailego/gpu/cuvs/zvec_cuvs.h	C++ cuVS bindings header defining IVF-PQ, CAGRA, and HNSW index interfaces with parameters
src/CMakeLists.txt	CMake config rewritten to support CUDA, Metal, and optional cuVS with proper architecture flags
python/tests/test_backends.py	New test suite covering hardware detection, GPU index operations, and quantization (PQ/OPQ)

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Vector Search Request] --> Detect[Hardware Detection]
    Detect --> CheckCppCuvs{C++ cuVS<br/>Available?}
    CheckCppCuvs -->|Yes| UseCppCuvs[Use C++ cuVS<br/>CAGRA/IVF-PQ]
    CheckCppCuvs -->|No| CheckPyCuvs{Python cuVS<br/>Available?}
    CheckPyCuvs -->|Yes| UsePyCuvs[Use Python cuVS<br/>CuPy arrays]
    CheckPyCuvs -->|No| CheckFaissGpu{FAISS GPU +<br/>NVIDIA GPU?}
    CheckFaissGpu -->|Yes| UseFaissGpu[Use FAISS GPU]
    CheckFaissGpu -->|No| CheckMps{Apple Silicon<br/>MPS?}
    CheckMps -->|Yes| UseMps[Use Metal Kernels<br/>simdgroup ops]
    CheckMps -->|No| CheckFaissCpu{FAISS CPU?}
    CheckFaissCpu -->|Yes| UseFaissCpu[Use FAISS CPU<br/>+ Accelerate]
    CheckFaissCpu -->|No| UseNumpy[Fallback to NumPy]
    
    UseCppCuvs --> Execute[Execute Search]
    UsePyCuvs --> Execute
    UseFaissGpu --> Execute
    UseMps --> MetalKernels[Metal Kernels:<br/>L2/cosine/topk]
    MetalKernels --> Execute
    UseFaissCpu --> Execute
    UseNumpy --> Execute
    
    Execute --> Results[Return distances<br/>+ indices]
    
    style UseMps fill:#a8dadc
    style MetalKernels fill:#a8dadc
    style UseCppCuvs fill:#f1faee
    style UsePyCuvs fill:#f1faee

_{Last reviewed commit: b08a835}

greptile-apps

_{41 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-27T02:28:26Z

src/ailego/gpu/metal/distance.metal

+            for (uint prev = 0; prev < ki; prev++) {
+                if (out_i[prev] == i) { already = true; break; }
+            }


O(k²) complexity checking already-selected indices. For each of k rounds, and for each candidate, this checks up to k previous selections. Consider using threadgroup shared memory for a bitmask or selected indices array to avoid repeated scans.

Suggested change

for (uint prev = 0; prev < ki; prev++) {

if (out_i[prev] == i) { already = true; break; }

}

// Use threadgroup shared memory for selected indices (requires changes to kernel signature)

// threadgroup uint selected_mask[MAX_DATABASE / 32];

// Check: bool already = (selected_mask[i / 32] & (1u << (i % 32))) != 0;

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

cluster2600 added 26 commits February 24, 2026 13:59

feat: add distributed index implementation

2be6793

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

docs: add comprehensive documentation and tests

c5407b8

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

fix: PQ encoder - handle small datasets properly

46ce49d

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

feat: add cuVS wrapper skeleton

ca1f273

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

feat: add cuVS IVF-PQ and CAGRA implementations

f5e1567

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

feat: add cuVS HNSW wrapper

fee7f2a

- 9x speedup target vs CPU - Compatible with DiskANN

feat: add cuVS vs FAISS benchmark script

0196637

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

feat: complete S3-S8 research and implementations

0b6f99c

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

add: Kaggle benchmark notebook

d98a66c

fix: Kaggle notebook path

ab1264f

fix: Kaggle notebook - test Python modules only

0d81b34

fix: Colab notebook - proper path and FAISS GPU test

8e69282

fix: export backends module

b064dcc

fix: Colab notebook - full test

79b837f

fix: clean clone

f61f973

add: simple colab test

c304405

add: full GPU benchmark suite

2e4be16

add: extended GPU benchmarks

48083ab

fix: add cuVS detection and C++ priority to backend selection

b08a835

Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

cluster2600 mentioned this pull request Feb 26, 2026

Proposal: simdgroup-optimized Metal compute kernels #177

Open

feihongxu0824 assigned richyreachy Feb 27, 2026

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add simdgroup-optimized Metal kernels#172

feat: add simdgroup-optimized Metal kernels#172
cluster2600 wants to merge 26 commits intoalibaba:mainfrom
cluster2600:feat/metal-simdgroup-kernels

cluster2600 commented Feb 25, 2026 •

edited

Loading

Uh oh!

cluster2600 commented Feb 26, 2026

Uh oh!

Cuiyus commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            for (uint prev = 0; prev < ki; prev++) {
-                if (out_i[prev] == i) { already = true; break; }
-            }
+            // Use threadgroup shared memory for selected indices (requires changes to kernel signature)
+            // threadgroup uint selected_mask[MAX_DATABASE / 32];
+            // Check: bool already = (selected_mask[i / 32] & (1u << (i % 32))) != 0;

Conversation

cluster2600 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New kernels

Dispatch model

Fixes to existing kernels

Merge order

Test plan

Uh oh!

cluster2600 commented Feb 26, 2026

Uh oh!

Cuiyus commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Key Changes

Issues Found

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cluster2600 commented Feb 25, 2026 •

edited

Loading