perf: Multigpu fixes by teezeerc · Pull Request #72 · p-e-w/heretic

teezeerc · 2025-12-05T22:19:12Z

Keeps tensors on the same device, preventing:

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)

p-e-w · 2025-12-05T23:25:29Z

Thanks for the PR!

This was supposedly already fixed in #17, and several people, including the author of that PR, have had success on multi-GPU setups. Can you explain what #17 missed so I can understand this better (I don't have multiple GPUs myself)?

teezeerc · 2025-12-06T10:27:53Z

It seems that, my crash (mat2 is on cuda:1, different from other tensors on cuda:0) was caused by mixed devices when doing the rank-1 subtraction inside abliterate. There is the code that already tries to move r to matrix.device, but two issues remain:

1.it doesn't ensure r has the same dtype as matrix (so .matmul may create intermediate tensors on a different device when using certain backends / sharded tensors), and

2.it assumes matrix is 2-D so torch.outer will always work — but some implementations (MoE variants like gpt-oss) store expert weights in a single 3-D tensor (E, d, k). In that case r^T W is 2-D (E, k) and torch.outer fails / yields wrong shapes.

So I added additional 'keep everything on one device for calculation' part.

p-e-w · 2025-12-07T01:14:49Z

Thanks for pointing out the problem with torch.outer! I have reverted #46 for now, which introduced that outer product. Please check whether this fixes the problem you're seeing.

JoshTickles · 2025-12-08T22:53:04Z

I have the same issue (I have multiple 5090s).

I can confirm that for my setup, removal of #46 allowed multiple GPUs to operate again.

teezeerc · 2025-12-09T03:07:58Z

Confirmed, It doesn't crash after removing #46.

p-e-w · 2025-12-09T04:25:56Z

Thank you @JoshTickles and @teezeerc. I'm keeping this PR open as a reminder that #46 should be re-merged in a fixed form.

p-e-w · 2025-12-16T10:16:24Z

Resolved by #60.

teezeerc added 3 commits December 5, 2025 01:23

multi gpu fix

e85d1e1

multi gpu fix

feb04da

multi gpu debug config

c72b45a

p-e-w mentioned this pull request Dec 5, 2025

Fix for multigpu - tensors on the same GPU #71

Closed

p-e-w mentioned this pull request Dec 7, 2025

perf: optimize abliteration matrix op #46

Merged

p-e-w closed this Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Multigpu fixes#72

perf: Multigpu fixes#72
teezeerc wants to merge 3 commits intop-e-w:masterfrom
teezeerc:multigpu

teezeerc commented Dec 5, 2025

Uh oh!

p-e-w commented Dec 5, 2025

Uh oh!

teezeerc commented Dec 6, 2025 •

edited

Loading

Uh oh!

p-e-w commented Dec 7, 2025

Uh oh!

JoshTickles commented Dec 8, 2025

Uh oh!

teezeerc commented Dec 9, 2025

Uh oh!

p-e-w commented Dec 9, 2025

Uh oh!

p-e-w commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teezeerc commented Dec 5, 2025

Uh oh!

p-e-w commented Dec 5, 2025

Uh oh!

teezeerc commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Dec 7, 2025

Uh oh!

JoshTickles commented Dec 8, 2025

Uh oh!

teezeerc commented Dec 9, 2025

Uh oh!

p-e-w commented Dec 9, 2025

Uh oh!

p-e-w commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

teezeerc commented Dec 6, 2025 •

edited

Loading