Algorithm does not work for n=1

Hi author,

Thank you for the great work. The algorithm runs very fast! 
However, I think the current algorithm does not consider the corner case with just single GPU (n=1), and in this case, the allocate function's while loop just run forever. 

Is there a way to easily fix the problem? 

Thank you!