Proposal: simdgroup-optimized Metal compute kernels

## Motivation

The existing Metal kernels in zvec use basic threadgroup reductions with shared memory barriers. Apple's simdgroup cooperative intrinsics (`simd_sum`, `simd_min`, `simd_shuffle`) enable **hardware-level reductions across 32 SIMD lanes** without shared memory, which should be faster on all Apple GPU families >= Apple4.

## Proposed approach

Add 6 new Metal kernels under `src/ailego/gpu/metal/`:

| Kernel | Purpose |
|--------|---------|
| `metal_l2_distance_simdgroup` | 32-thread cooperative L2 distance |
| `metal_inner_product_simdgroup` | 32-thread cooperative dot product |
| `metal_cosine_simdgroup` | Cosine similarity with joint norm computation |
| `metal_topk_simdgroup` | Parallel top-k selection using simd min-reduce |
| `metal_batch_distance_simdgroup` | Fused batch distance for multi-query search |
| `metal_fused_knn_simdgroup` | Full fused KNN (distance + top-k in one dispatch) |

All kernels would be standalone `.metal` files with a C++ test. No Python integration in this step — that would come separately.

## Questions for maintainers

1. Is this the right directory (`src/ailego/gpu/metal/`)? I see existing Metal files there.
2. Should these replace the current kernels or live alongside them?
3. Any preference on Metal Shading Language version (I'm targeting MSL 2.4+)?
4. Per your feedback on #166, the Metal backend needs "more integration logic" — what integration points would be most useful?

Draft implementation: #172

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: simdgroup-optimized Metal compute kernels #177

Motivation

Proposed approach

Questions for maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel	Purpose
`metal_l2_distance_simdgroup`	32-thread cooperative L2 distance
`metal_inner_product_simdgroup`	32-thread cooperative dot product
`metal_cosine_simdgroup`	Cosine similarity with joint norm computation
`metal_topk_simdgroup`	Parallel top-k selection using simd min-reduce
`metal_batch_distance_simdgroup`	Fused batch distance for multi-query search
`metal_fused_knn_simdgroup`	Full fused KNN (distance + top-k in one dispatch)

Proposal: simdgroup-optimized Metal compute kernels #177

Description

Motivation

Proposed approach

Questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions