Conversation
|
It seems that, my crash (mat2 is on cuda:1, different from other tensors on cuda:0) was caused by mixed devices when doing the rank-1 subtraction inside abliterate. There is the code that already tries to move r to matrix.device, but two issues remain: 1.it doesn't ensure r has the same dtype as matrix (so .matmul may create intermediate tensors on a different device when using certain backends / sharded tensors), and 2.it assumes matrix is 2-D so torch.outer will always work — but some implementations (MoE variants like gpt-oss) store expert weights in a single 3-D tensor (E, d, k). In that case r^T W is 2-D (E, k) and torch.outer fails / yields wrong shapes. So I added additional 'keep everything on one device for calculation' part. |
|
Thanks for pointing out the problem with |
|
I have the same issue (I have multiple 5090s). I can confirm that for my setup, removal of #46 allowed multiple GPUs to operate again. |
|
Confirmed, It doesn't crash after removing #46. |
|
Thank you @JoshTickles and @teezeerc. I'm keeping this PR open as a reminder that #46 should be re-merged in a fixed form. |
|
Resolved by #60. |
Keeps tensors on the same device, preventing:
RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)