feat: add simdgroup-optimized Metal kernels#172
feat: add simdgroup-optimized Metal kernels#172cluster2600 wants to merge 26 commits intoalibaba:mainfrom
Conversation
- backends/detect.py: Hardware detection - backends/gpu.py: FAISS GPU integration - backends/quantization.py: Product Quantization - backends/opq.py: OPQ + Scalar Quantization - backends/search.py: Search optimization - backends/hnsw.py: HNSW implementation - backends/apple_silicon.py: Apple Silicon optimization - backends/benchmark.py: Benchmarks Internal sprint work - not for upstream PR.
- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding
- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests
- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape
Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA
Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available
- 9x speedup target vs CPU - Compatible with DiskANN
Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown
S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.
1. cuVS C++ bindings (zvec_cuvs.h) - IVFPQ, CAGRA, HNSW index classes - Template-based for float/uint8_t/int8_t 2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu) - Coalesced L2 distance (2-8x speedup) - Warp-level reductions - FP16 support - Tiled shared memory version 3. Metal MPS kernels (distance.metal) - L2 distance with SIMD/NEON - FP16 support for Apple Silicon - Batch processing - Matrix multiplication All based on scientific papers.
1. SIMD CPU optimization (simd_distance.h) - SSE2, AVX2 for x86 - NEON for ARM/Apple Silicon - 4-16x speedup expected 2. CMake build system (CMakeLists.txt) - CUDA coalesced kernels - Metal shaders - SIMD CPU - Optional cuVS integration 3. Graph-based ANN (graph_ann.h) - CAGRA-like implementation - NN-Descent graph construction - Hierarchical search
1. FastScan (simd_distance.h) - SIMD-optimized Product Quantization - AVX2 distance computation - Bitonic sort for k-selection 2. Vamana Graph (vamana.h) - DiskANN algorithm - Robust to search parameters - Used in Azure AI Search 3. NUMA-aware (numa.h) - Per-NUMA-node allocation - Work-stealing thread pool - 6-20x speedup on multi-socket Based on papers: - Quake (OSDI 2025): NUMA-aware partitioning - FAISS (2024): FastScan SIMD optimization - DiskANN: Vamana graph
1. Lock-free concurrent structures (lockfree.h) - LockFreeVector (Stroustrup design) - AtomicIndex for HNSW - Hazard pointer reclamation 2. Memory pool optimizations (memory_pool.h) - Aligned allocator (cache-line, huge pages) - Object pool - Slab allocator - SoA layout 3. Batch processing (batch.h) - Transposed matrix for PQ (30-50% faster) - Loop unrolling - AVX-512 support - PQ distance tables Based on: - FAISS optimization guide - Stroustrup lock-free vector - OptiTrust paper (2024)
Add 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes without shared memory barriers: - metal_l2_distance_simdgroup: cooperative L2 distance - metal_inner_product_simdgroup: cooperative dot product - metal_cosine_similarity_simdgroup: normalized inner product - metal_topk_simdgroup: per-query top-k selection via simd_min - metal_matmul_tiled: tiled matmul with threadgroup shared memory - metal_normalize_simdgroup: in-place L2 normalization Also fixes existing kernels: - Replace simd_make_float4 with float4 constructor (MSL compliance) - Add device address space qualifiers in batch kernel Tested: compiles cleanly with metal -std=metal3.1 -W -Werror. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and cagra.search(SearchParams, index, queries, k) instead of the non-existent Index().build() / Index().search() methods - cuvs_ivf_pq.py: same pattern fix, plus correct import path (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq) - Both backends now convert numpy queries to cupy device arrays before search (cuVS requires CUDA-compatible memory) Tested on RTX 4090: - cuVS CAGRA: 43K QPS (50K vectors, dim=128) - cuVS IVF-PQ: 45K QPS (50K vectors, dim=128) - FAISS GPU: 529K QPS (50K vectors, dim=128, flat) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
|
Discussion issue opened: #177 — feedback welcome before review. |
|
@greptile |
Greptile SummaryThis PR adds hardware-accelerated Metal compute kernels for Apple Silicon, introducing 6 new simdgroup-optimized kernels that leverage cooperative SIMD intrinsics ( Key Changes
Issues Found
The implementation follows proper Metal Shading Language conventions and correctly uses simdgroup cooperative reductions for hardware acceleration. Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Start[Vector Search Request] --> Detect[Hardware Detection]
Detect --> CheckCppCuvs{C++ cuVS<br/>Available?}
CheckCppCuvs -->|Yes| UseCppCuvs[Use C++ cuVS<br/>CAGRA/IVF-PQ]
CheckCppCuvs -->|No| CheckPyCuvs{Python cuVS<br/>Available?}
CheckPyCuvs -->|Yes| UsePyCuvs[Use Python cuVS<br/>CuPy arrays]
CheckPyCuvs -->|No| CheckFaissGpu{FAISS GPU +<br/>NVIDIA GPU?}
CheckFaissGpu -->|Yes| UseFaissGpu[Use FAISS GPU]
CheckFaissGpu -->|No| CheckMps{Apple Silicon<br/>MPS?}
CheckMps -->|Yes| UseMps[Use Metal Kernels<br/>simdgroup ops]
CheckMps -->|No| CheckFaissCpu{FAISS CPU?}
CheckFaissCpu -->|Yes| UseFaissCpu[Use FAISS CPU<br/>+ Accelerate]
CheckFaissCpu -->|No| UseNumpy[Fallback to NumPy]
UseCppCuvs --> Execute[Execute Search]
UsePyCuvs --> Execute
UseFaissGpu --> Execute
UseMps --> MetalKernels[Metal Kernels:<br/>L2/cosine/topk]
MetalKernels --> Execute
UseFaissCpu --> Execute
UseNumpy --> Execute
Execute --> Results[Return distances<br/>+ indices]
style UseMps fill:#a8dadc
style MetalKernels fill:#a8dadc
style UseCppCuvs fill:#f1faee
style UsePyCuvs fill:#f1faee
Last reviewed commit: b08a835 |
| for (uint prev = 0; prev < ki; prev++) { | ||
| if (out_i[prev] == i) { already = true; break; } | ||
| } |
There was a problem hiding this comment.
O(k²) complexity checking already-selected indices. For each of k rounds, and for each candidate, this checks up to k previous selections. Consider using threadgroup shared memory for a bitmask or selected indices array to avoid repeated scans.
| for (uint prev = 0; prev < ki; prev++) { | |
| if (out_i[prev] == i) { already = true; break; } | |
| } | |
| // Use threadgroup shared memory for selected indices (requires changes to kernel signature) | |
| // threadgroup uint selected_mask[MAX_DATABASE / 32]; | |
| // Check: bool already = (selected_mask[i / 32] & (1u << (i % 32))) != 0; |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
Adds 6 new Metal compute kernels using simdgroup cooperative intrinsics (
simd_sum,simd_min,simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes — no shared memory barriers needed.Follow-up to #166 ("Future Work: SIMD optimization").
New kernels
metal_l2_distance_simdgroupmetal_inner_product_simdgroupmetal_cosine_similarity_simdgroupmetal_topk_simdgroupsimd_minlane votingmetal_matmul_tiledmetal_normalize_simdgroupDispatch model
(n_database, n_queries)threadgroupsFixes to existing kernels
simd_make_float4→float4constructor (MSL compliance)deviceaddress space qualifiers inmetal_l2_distance_batchMerge order
Test plan
metal -std=metal3.1 -W -Werroron macOS with Xcode Metal toolchain