Draft
Conversation
fd871bc to
fa9b5e4
Compare
Add a minimal viable MPS platform so vLLM can detect and use Apple Silicon GPUs via the Metal Performance Shaders backend. This enables model loading and inference on macOS without CUDA. New files: - vllm/platforms/mps.py: MPS platform class (device detection, memory APIs, config validation) - vllm/v1/attention/backends/mps_attn.py: Pure PyTorch attention with paged KV cache (no C++ extensions needed) - vllm/v1/worker/mps_model_runner.py: MPS model runner extending GPUModelRunner with CUDA stub wrappers - vllm/v1/worker/mps_worker.py: MPS worker with gloo distributed backend Modified files: - PlatformEnum.MPS added to interface.py with is_mps() method - MPS platform plugin in __init__.py; CPU plugin updated to avoid mutual exclusion on macOS - forward_mps() dispatch added to CustomOp - MPS_ATTN registered in attention backend registry - "mps" added to Device literal type Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
- test_llama_7b_bfloat16_generation: Run Llama-7B inference with BF16 on MPS - test_llama_7b_float16_generation: Run Llama-7B inference with FP16 on MPS - These tests validate real-world inference performance with Metal kernels - Includes memory utilization and generation quality checks These are the primary E2E validation tests for the vLLM MPS platform integration with Hub Metal kernels. Co-developed-by: Claude Code v2.0.76 (claude-haiku-4-5-20251001) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
- benchmark_mps_vs_llamacpp.py: Measure throughput, latency, memory usage - Supports BF16, FP16, FP32 precision - Configurable prompt/token count for flexible benchmarking - Outputs metrics: tokens/sec, ms/token, peak GPU memory - Includes instructions for running equivalent llama.cpp benchmark This enables quantitative E2E validation against llama.cpp Metal backend. Co-developed-by: Claude Code v2.0.76 (claude-haiku-4-5-20251001) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
Branch AWQ apply() and GPTQ process_weights_after_loading()/apply() on is_mps() to use dequant+matmul instead of CUDA-only fused kernels. On MPS, GPTQ skips gptq_shuffle (exllama reorder) and dequantizes from the original checkpoint layout. AWQ uses its native interleaved bit order directly. The mps_dequant.py wrapper tries to import the dequant_int4 Metal kernel package for GPU-accelerated dequant, falling back to pure PyTorch bitwise operations when the package isn't installed. Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
Add Metal kernel path for GGUF quantized models on MPS (Apple Metal). Implements dequant+matmul for Q4_0, Q8_0, and Q4_K types via the dequant_gguf kernel package, with a numpy-based fallback using the gguf Python library. Changes: - gguf.py: Add MPS branch in _fused_mul_mat_gguf and _apply_gguf_embedding to route through gguf_dequant_on_mps instead of CUDA ops - gguf.py: Fix get_supported_act_dtypes and get_min_capability for MPS - mps_dequant.py: Add GGUF section with Metal kernel import, numpy fallback, and gguf_dequant_on_mps entry point Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
Add MPS as a GPU backend tab in the installation docs alongside CUDA, ROCm, and XPU. Covers requirements, build from source, optional Metal quantization kernels, usage examples, performance expectations, memory guidelines, and troubleshooting. Update cpu.apple.inc.md to point to the new GPU/MPS docs instead of the external vllm-metal project. Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6) Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
fa9b5e4 to
6102f77
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
forward_mps()falling back toforward_native()macos-15-xlargerunner with MPS platform assertionNew files
vllm/platforms/mps.py— MPS platform classvllm/v1/attention/backends/mps_attn.py— Pure PyTorch attention with paged KV cachevllm/v1/worker/mps_model_runner.py— MPS model runnervllm/v1/worker/mps_worker.py— MPS workerModified files
vllm/platforms/interface.py—PlatformEnum.MPS,is_mps()vllm/platforms/__init__.py— MPS plugin, CPU plugin mutual exclusion fixvllm/model_executor/custom_op.py—forward_mps()dispatchvllm/v1/attention/backends/registry.py—MPS_ATTNenumvllm/config/device.py—"mps"in Device literal.github/workflows/macos-smoke-test.yml— xlarge runner, PR trigger, MPS verificationTest plan
vllm servewith dummy weights starts and responds on MPScurrent_platform.is_mps() == Trueon Apple Silicon