Hi author,
Thank you for the great work. The algorithm runs very fast!
However, I think the current algorithm does not consider the corner case with just single GPU (n=1), and in this case, the allocate function's while loop just run forever.
Is there a way to easily fix the problem?
Thank you!