-
Notifications
You must be signed in to change notification settings - Fork 454
Open
Description
Motivation
The existing Metal kernels in zvec use basic threadgroup reductions with shared memory barriers. Apple's simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) enable hardware-level reductions across 32 SIMD lanes without shared memory, which should be faster on all Apple GPU families >= Apple4.
Proposed approach
Add 6 new Metal kernels under src/ailego/gpu/metal/:
| Kernel | Purpose |
|---|---|
metal_l2_distance_simdgroup |
32-thread cooperative L2 distance |
metal_inner_product_simdgroup |
32-thread cooperative dot product |
metal_cosine_simdgroup |
Cosine similarity with joint norm computation |
metal_topk_simdgroup |
Parallel top-k selection using simd min-reduce |
metal_batch_distance_simdgroup |
Fused batch distance for multi-query search |
metal_fused_knn_simdgroup |
Full fused KNN (distance + top-k in one dispatch) |
All kernels would be standalone .metal files with a C++ test. No Python integration in this step — that would come separately.
Questions for maintainers
- Is this the right directory (
src/ailego/gpu/metal/)? I see existing Metal files there. - Should these replace the current kernels or live alongside them?
- Any preference on Metal Shading Language version (I'm targeting MSL 2.4+)?
- Per your feedback on feat: add GPU backends, quantization, and search optimizations #166, the Metal backend needs "more integration logic" — what integration points would be most useful?
Draft implementation: #172
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Backlog