Skip to content

Proposal: simdgroup-optimized Metal compute kernels #177

@cluster2600

Description

@cluster2600

Motivation

The existing Metal kernels in zvec use basic threadgroup reductions with shared memory barriers. Apple's simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) enable hardware-level reductions across 32 SIMD lanes without shared memory, which should be faster on all Apple GPU families >= Apple4.

Proposed approach

Add 6 new Metal kernels under src/ailego/gpu/metal/:

Kernel Purpose
metal_l2_distance_simdgroup 32-thread cooperative L2 distance
metal_inner_product_simdgroup 32-thread cooperative dot product
metal_cosine_simdgroup Cosine similarity with joint norm computation
metal_topk_simdgroup Parallel top-k selection using simd min-reduce
metal_batch_distance_simdgroup Fused batch distance for multi-query search
metal_fused_knn_simdgroup Full fused KNN (distance + top-k in one dispatch)

All kernels would be standalone .metal files with a C++ test. No Python integration in this step — that would come separately.

Questions for maintainers

  1. Is this the right directory (src/ailego/gpu/metal/)? I see existing Metal files there.
  2. Should these replace the current kernels or live alongside them?
  3. Any preference on Metal Shading Language version (I'm targeting MSL 2.4+)?
  4. Per your feedback on feat: add GPU backends, quantization, and search optimizations #166, the Metal backend needs "more integration logic" — what integration points would be most useful?

Draft implementation: #172

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions