Skip to content

perf: Multigpu fixes#72

Closed
teezeerc wants to merge 3 commits intop-e-w:masterfrom
teezeerc:multigpu
Closed

perf: Multigpu fixes#72
teezeerc wants to merge 3 commits intop-e-w:masterfrom
teezeerc:multigpu

Conversation

@teezeerc
Copy link

@teezeerc teezeerc commented Dec 5, 2025

Keeps tensors on the same device, preventing:

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)

@p-e-w
Copy link
Owner

p-e-w commented Dec 5, 2025

Thanks for the PR!

This was supposedly already fixed in #17, and several people, including the author of that PR, have had success on multi-GPU setups. Can you explain what #17 missed so I can understand this better (I don't have multiple GPUs myself)?

@teezeerc
Copy link
Author

teezeerc commented Dec 6, 2025

It seems that, my crash (mat2 is on cuda:1, different from other tensors on cuda:0) was caused by mixed devices when doing the rank-1 subtraction inside abliterate. There is the code that already tries to move r to matrix.device, but two issues remain:

1.it doesn't ensure r has the same dtype as matrix (so .matmul may create intermediate tensors on a different device when using certain backends / sharded tensors), and

2.it assumes matrix is 2-D so torch.outer will always work — but some implementations (MoE variants like gpt-oss) store expert weights in a single 3-D tensor (E, d, k). In that case r^T W is 2-D (E, k) and torch.outer fails / yields wrong shapes.

So I added additional 'keep everything on one device for calculation' part.

@p-e-w
Copy link
Owner

p-e-w commented Dec 7, 2025

Thanks for pointing out the problem with torch.outer! I have reverted #46 for now, which introduced that outer product. Please check whether this fixes the problem you're seeing.

@JoshTickles
Copy link

I have the same issue (I have multiple 5090s).

I can confirm that for my setup, removal of #46 allowed multiple GPUs to operate again.

@teezeerc
Copy link
Author

teezeerc commented Dec 9, 2025

Confirmed, It doesn't crash after removing #46.

@p-e-w
Copy link
Owner

p-e-w commented Dec 9, 2025

Thank you @JoshTickles and @teezeerc. I'm keeping this PR open as a reminder that #46 should be re-merged in a fixed form.

@p-e-w
Copy link
Owner

p-e-w commented Dec 16, 2025

Resolved by #60.

@p-e-w p-e-w closed this Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants