Let's start GPU accelerating with a Pytorch index. Dot products/cosine similarity are both nearly equivalent to a matrix multiplication, so using hardware accelerators seems to be useful here. On 32 GB of VRAM, we could fit 22 million MiniLM embeddings (384 dimensions on f32 precision) on a single GPU.