feat: add GPU backends, quantization, and search optimizations#166
feat: add GPU backends, quantization, and search optimizations#166cluster2600 wants to merge 6 commits intoalibaba:mainfrom
Conversation
Add Metal Shading Language kernels for GPU-accelerated vector operations on Apple Silicon, including L2 distance, inner product, cosine similarity, vector normalization, matrix multiplication, and top-k selection. Includes C API wrapper, CMakeLists.txt for Metal compilation, and comprehensive Google Test suite. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add unified acceleration module supporting FAISS CPU and GPU backends with automatic hardware detection. Includes backend benchmark suite for performance comparison and realistic dataset benchmarks. New files: - python/zvec/accelerate.py: Unified accelerator interface - python/zvec/backends/gpu.py: FAISS GPU backend - python/zvec/backends/detect.py: Hardware detection - python/zvec/backends/benchmark.py: Performance benchmarks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Product Quantization (PQ) encoder, Optimized Product Quantization (OPQ) with rotation learning, and Scalar Quantization (8/16-bit) for efficient vector compression and approximate nearest neighbor search. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pure Python HNSW index with FAISS fallback, optimized search functions (ADC, batched search, reranking), and Apple Silicon MPS backend using PyTorch for GPU-accelerated vector operations on macOS. Update pyproject.toml with accelerate/gpu optional dependencies and per-file-ignores for backends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add header-only C++ implementations of Product Quantization (PQ) and Optimized Product Quantization (OPQ), plus upgrade the Python OPQ rotation from QR decomposition to SVD-based Orthogonal Procrustes. C++ Product Quantizer (product_quantizer.h): - k-means training with configurable m sub-quantizers and k centroids - encode/decode with distortion measurement - Header-only, depends only on <algorithm>, <cmath>, <vector> C++ OPQ (opq.h): - SVD-based Procrustes rotation: R = V * U^T from SVD(X^T * Y) - Self-contained Jacobi one-sided SVD solver (no LAPACK dependency) - Iterative refinement of rotation + PQ codebooks Python OPQ (_learn_rotation): - Replace simplified QR decomposition with SVD Procrustes - M = X^T @ decoded, U, _, Vt = svd(M), R = Vt.T @ U.T - Produces orthogonal rotations (error ~4e-6) - Benchmarked: ~1-10% reconstruction improvement over plain PQ Follow-up to alibaba#166 ("Future Work: sophisticated OPQ optimization"). Tested on: - macOS: clang++ C++17 compilation + runtime tests - Linux (Blackwell GPU): Python OPQ + cuVS CAGRA integration Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Add persistent vector storage backed by RocksDB for GPU pipeline integration, plus documentation for the Metal C++ backend. VectorStorage (vector_storage.h): - RocksDB column families: "vectors", "pq_codes", "metadata" - Batch put/get for raw vectors and PQ codes - load_all() streams vectors into contiguous GPU-ready float buffer - Integrates with existing RocksdbContext wrapper Documentation (docs/METAL_CPP.md): - Architecture overview: RocksDB → load_all() → Metal GPU Buffers - Complete kernel reference table (distance, utility kernels) - Simdgroup optimization dispatch model - C++ PQ/OPQ API examples - RocksDB storage API examples Follow-up to alibaba#166 ("Future Work: Integration with RocksDB storage"). Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
|
Hi there, thank you for the effort! We found that this PR covers a few different areas (C++ and Python backends) that might be better handled separately. Would you be open to splitting them up after proper discussion? To save you some time and effort, it’s usually best to open an issue, discuss the approach and reach a consensus before you start (vibe) coding. We'd love to hear your thoughts there! Thanks. |
|
Hi @egolearner, thanks for the feedback! You're absolutely right — #166 was too broad. I've already split it into focused, independent PRs:
I'll open separate issues for each to discuss the approach before asking for review. Thanks for guiding the process! |
Summary
Changes
C++ Metal Backend
src/ailego/gpu/metal/zvec_metal.h— C API headersrc/ailego/gpu/metal/zvec_metal.cc— Objective-C++ implementationsrc/ailego/gpu/metal/zvec_metal.metal— Metal shaders (L2, IP, cosine, normalize, matmul, top-k)src/ailego/gpu/metal/CMakeLists.txt— Metal compilationtests/test_metal.cc— Google Test suitePython Backends
python/zvec/accelerate.py— Unified accelerator interfacepython/zvec/backends/gpu.py— FAISS GPU backendpython/zvec/backends/detect.py— Hardware detectionpython/zvec/backends/quantization.py— PQ encoderpython/zvec/backends/opq.py— OPQ encoder + Scalar Quantizerpython/zvec/backends/hnsw.py— Pure Python HNSW with FAISS fallbackpython/zvec/backends/search.py— ADC, batch search, rerankingpython/zvec/backends/apple_silicon.py— Apple Silicon MPS backendpython/zvec/backends/benchmark.py— Backend performance benchmarksConfiguration
pyproject.toml—accelerate/gpuoptional dependencies, per-file-ignores for backendsDocs
docs/METAL_CPP.md— Metal backend documentationContext
Split from #157. Aligns with cluster2600#2 content.
Test plan