Add MPS (Apple Metal) platform support by robtaylor · Pull Request #1 · ChipFlow/vllm

robtaylor · 2026-03-04T23:13:50Z

Summary

Add MPS platform detection so vLLM uses Apple Silicon GPU instead of falling to CPU on macOS
Pure PyTorch attention backend with paged KV cache (no C++ extensions needed)
MPS worker and model runner extending the GPU base classes with CUDA stub wrappers
CustomOp dispatch for forward_mps() falling back to forward_native()
CI updated to use macos-15-xlarge runner with MPS platform assertion

New files

vllm/platforms/mps.py — MPS platform class
vllm/v1/attention/backends/mps_attn.py — Pure PyTorch attention with paged KV cache
vllm/v1/worker/mps_model_runner.py — MPS model runner
vllm/v1/worker/mps_worker.py — MPS worker

Modified files

vllm/platforms/interface.py — PlatformEnum.MPS, is_mps()
vllm/platforms/__init__.py — MPS plugin, CPU plugin mutual exclusion fix
vllm/model_executor/custom_op.py — forward_mps() dispatch
vllm/v1/attention/backends/registry.py — MPS_ATTN enum
vllm/config/device.py — "mps" in Device literal
.github/workflows/macos-smoke-test.yml — xlarge runner, PR trigger, MPS verification

Test plan

CI: MPS platform detection assertion passes on macos-15-xlarge
CI: vllm serve with dummy weights starts and responds on MPS
Local: verified current_platform.is_mps() == True on Apple Silicon
Local: all new and modified files pass py_compile

Add a minimal viable MPS platform so vLLM can detect and use Apple Silicon GPUs via the Metal Performance Shaders backend. This enables model loading and inference on macOS without CUDA. New files: - vllm/platforms/mps.py: MPS platform class (device detection, memory APIs, config validation) - vllm/v1/attention/backends/mps_attn.py: Pure PyTorch attention with paged KV cache (no C++ extensions needed) - vllm/v1/worker/mps_model_runner.py: MPS model runner extending GPUModelRunner with CUDA stub wrappers - vllm/v1/worker/mps_worker.py: MPS worker with gloo distributed backend Modified files: - PlatformEnum.MPS added to interface.py with is_mps() method - MPS platform plugin in __init__.py; CPU plugin updated to avoid mutual exclusion on macOS - forward_mps() dispatch added to CustomOp - MPS_ATTN registered in attention backend registry - "mps" added to Device literal type Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>

- test_llama_7b_bfloat16_generation: Run Llama-7B inference with BF16 on MPS - test_llama_7b_float16_generation: Run Llama-7B inference with FP16 on MPS - These tests validate real-world inference performance with Metal kernels - Includes memory utilization and generation quality checks These are the primary E2E validation tests for the vLLM MPS platform integration with Hub Metal kernels. Co-developed-by: Claude Code v2.0.76 (claude-haiku-4-5-20251001) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>

- benchmark_mps_vs_llamacpp.py: Measure throughput, latency, memory usage - Supports BF16, FP16, FP32 precision - Configurable prompt/token count for flexible benchmarking - Outputs metrics: tokens/sec, ms/token, peak GPU memory - Includes instructions for running equivalent llama.cpp benchmark This enables quantitative E2E validation against llama.cpp Metal backend. Co-developed-by: Claude Code v2.0.76 (claude-haiku-4-5-20251001) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>

Branch AWQ apply() and GPTQ process_weights_after_loading()/apply() on is_mps() to use dequant+matmul instead of CUDA-only fused kernels. On MPS, GPTQ skips gptq_shuffle (exllama reorder) and dequantizes from the original checkpoint layout. AWQ uses its native interleaved bit order directly. The mps_dequant.py wrapper tries to import the dequant_int4 Metal kernel package for GPU-accelerated dequant, falling back to pure PyTorch bitwise operations when the package isn't installed. Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>

Add Metal kernel path for GGUF quantized models on MPS (Apple Metal). Implements dequant+matmul for Q4_0, Q8_0, and Q4_K types via the dequant_gguf kernel package, with a numpy-based fallback using the gguf Python library. Changes: - gguf.py: Add MPS branch in _fused_mul_mat_gguf and _apply_gguf_embedding to route through gguf_dequant_on_mps instead of CUDA ops - gguf.py: Fix get_supported_act_dtypes and get_min_capability for MPS - mps_dequant.py: Add GGUF section with Metal kernel import, numpy fallback, and gguf_dequant_on_mps entry point Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>

Add MPS as a GPU backend tab in the installation docs alongside CUDA, ROCm, and XPU. Covers requirements, build from source, optional Metal quantization kernels, usage examples, performance expectations, memory guidelines, and troubleshooting. Update cpu.apple.inc.md to point to the new GPU/MPS docs instead of the external vllm-metal project. Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>

robtaylor force-pushed the mps-platform-support branch 9 times, most recently from fd871bc to fa9b5e4 Compare March 10, 2026 00:40

robtaylor added 6 commits March 10, 2026 18:42

robtaylor force-pushed the mps-platform-support branch from fa9b5e4 to 6102f77 Compare March 10, 2026 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MPS (Apple Metal) platform support#1

Add MPS (Apple Metal) platform support#1
robtaylor wants to merge 6 commits intomainfrom
mps-platform-support

robtaylor commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robtaylor commented Mar 4, 2026

Summary

New files

Modified files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant