Skip to content

Implement Magnitude-Preserving Orthogonal Ablation#52

Merged
p-e-w merged 8 commits intop-e-w:masterfrom
spikymoth:implement-mpoa
Feb 2, 2026
Merged

Implement Magnitude-Preserving Orthogonal Ablation#52
p-e-w merged 8 commits intop-e-w:masterfrom
spikymoth:implement-mpoa

Conversation

@spikymoth
Copy link
Contributor

That is, https://huggingface.co/blog/grimjim/projected-abliteration and https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration with the updated name from https://huggingface.co/posts/grimjim/803126534676334 :)

Marking this as a draft mostly because I don't know the timeline for #43 - either this will need to be rebased on that, or that on this. But the code changes here are relatively small, so I think either direction should be simple enough.

A few notes:

  1. The magnitude preservation and orthogonal projection parts can be applied as independent refinements, so I added 2 options for them. From the 2nd article, you need the magnitude preservation for the biprojection to be better than regular projection.
  2. The heuristics applied in the article are replaced by heretic's random trials.
  3. I did not add magnitude sparsification, which is more of a pre-filtering step.
  4. heretic chooses (randomly) one of two strategies for each trial: 1. A refusal direction from a random point between 2 layers, 2. Per-layer refusal directions. The algorithm described in the article is more similar to the former: A refusal direction is chosen from a particular source layer, and then it is projected onto a target layer's harmless direction (this is the biprojection, as far as I can tell).
  5. In the article, 2 layers are chosen based on their heuristics. In heretic's random trials, only one layer is chosen - so you'd probably need 2 passes to match the original.

Additional caveats:

  1. I'm not a mathematician or have a particularly deep knowledge of matrix math (it doesn't really come up in my day job), so this needs checking to see if I interpreted things correctly.
  2. I've only lightly tested this to make sure results don't look broken (refusals go down, KL divergence looks reasonable); I haven't done verification of any metrics. Still running tests locally (another reason for the draft status), but advice on what to check would be welcome.
  3. From reading discussion around the method, I know there's some question about the applicability of KL divergence as a metric, since bigger changes may be "good". I've made no attempt to address that here.
  4. For a 2nd pass to work, you probably want to ignore parts of the training set that no longer result in refusal. I have some code for that (which seems to work quite well regardless of this change), but I'll make a separate PR for it.

@spikymoth spikymoth marked this pull request as draft November 26, 2025 09:22
@p-e-w
Copy link
Owner

p-e-w commented Nov 26, 2025

As you already hinted at, this isn't really compatible with Heretic's KL divergence score.

The biprojection technique is supposed to make models smarter by changing model outputs even for "harmless" prompts (increasing the KL divergence).

Thus the parameters that Heretic's optimizer converges to wouldn't be the ones that actually result in the claimed extra benefits from biprojection. If KL divergence is the correct metric, it doesn't really matter how you abliterate unless one abliteration technique results in lower KL divergences for a given refusal count.

@spikymoth
Copy link
Contributor Author

Yeah, I guess that while the refusal count is still an objective metric that makes sense (and is crucial for deciding which trial results to keep), the focus on KL divergence in determining the Pareto front of optimal solutions risks discarding good solutions that don't align with the metric (and currently there's no other way to distinguish them aside from chatting with the model).

I suppose it still acts as a somewhat decent sanity check for whether a change specifically targets things that are considered harmful, or completely breaks the model's ability to respond. But it won't be able to measure whether responses to safe or borderline requests become better or worse.

@spikymoth
Copy link
Contributor Author

Trying it myself, so far I've noticed that the results seem more constrained with this method, with no results that show both a high KL divergence and a near-zero number of refusals. In terms of KL divergence, the "bang for buck" seems about the same or slightly worse. The constrained nature of the results does mean that at least 2 passes seem necessary to approach zero refusals.

One thing I'm worried about is that the "per layer" direction scope might be outperforming the "global" direction scope in terms of KL divergence, whereas the global scope is closer to the MPOA algorithm. It might be a good idea to make the direction scope configurable (with values "global", "per layer" and "both", defaulting to "both"), and if both scopes are allowed, display the type in the results.

I pushed a commit to do this to my development branch: https://github.com/spikymoth/heretic/tree/development
I can add it to this pull request if it seems like a good idea, or save it for later.

@spikymoth
Copy link
Contributor Author

spikymoth commented Nov 27, 2025

One thing I'm worried about is that the "per layer" direction scope might be outperforming the "global" direction scope in terms of KL divergence

Interestingly, this was completely wrong. In fact, it's the "global" results that dominate - after running 200 trials, there's not a single "per layer" result in the list of best trials. Still a small dataset, but pretty striking.

Edit: I'm also realizing that I've significantly misunderstood how the optuna study actually works, and I need to reset part of my config to get a fair comparison to the original abliteration method.

@p-e-w
Copy link
Owner

p-e-w commented Nov 28, 2025

Interestingly, this was completely wrong. In fact, it's the "global" results that dominate - after running 200 trials, there's not a single "per layer" result in the list of best trials. Still a small dataset, but pretty striking.

This is a complicated topic that I'm currently investigating in detail.

The main problem is that some layers don't meaningfully separate residuals for good and bad prompts, so they cannot be self-abliterated. But then some layers in between such layers are important for model behavior, and absolutely do need to be abliterated. But since Heretic's weight kernels are unimodal, we cannot abliterate those layers without also abliterating the indeterminate layers that surround them, at least on one side. Thus per-layer abliteration gives poor results, and a global direction needs to be selected.

@mateusz-malicki
Copy link

some layers don't meaningfully separate residuals for good and bad prompts, so they cannot be self-abliterated. But then some layers in between such layers are important for model behavior, and absolutely do need to be abliterated

Hmmm... I was experimenting with modeling ablation weight as polynomials. While it does not solves the main issue, it can help with such cases. huge downside of my approach is the number of parameters we must add - 200 trials isn't even close to "enough" anymore (modeling constraints with motpe sucks).

@spikymoth
Copy link
Contributor Author

spikymoth commented Nov 28, 2025

I've found that with the settings from this PR and limiting the trials to the "global" direction scope, I was also seeing a lot of the better results in terms of lowering the number of refusals having a very high trial index. Currently doing a run with 500 trials and some Optuna visualization functions at the end to see how many trials are required for convergence, but it'll take a while to finish.

@spikymoth
Copy link
Contributor Author

spikymoth commented Nov 28, 2025

Added some commits to make the direction scope configurable and add optional winsorization (magnitude clipping) for the residual tensors, applied before taking the mean (set it to 99 to match the 0.995 from the MPOA articles). I must confess that I haven't fully tested the latter, though it seems straightforward enough.

I made the direction scope into an enum, which might be overkill but felt better than typing the same strings everywhere.

@spikymoth
Copy link
Contributor Author

I'm noticing that the optimization is wasting a lot of trials minimizing the KL divergence to tiny values like 0.002. I guess it must be tempting to drive it to 0 to minimize the scalarized objective, but values that low are pretty meaningless.

I'll try adding an offset of 0.01 to the KL divergence in the score to see if that encourages more aggressive solutions.

@mateusz-malicki
Copy link

How? Want to penalize "too low" kl-div? Better idea is to penalize too high refusals count on kl score, so those regions will be less atractive from sampler perspective - but it is still not the best idea im my opinion. Weighting kl-div vs refusals is hard, thats why you get pareto front from motpe instead of best result from tpe.

@spikymoth
Copy link
Contributor Author

How? Want to penalize "too low" kl-div?

I'm not penalizing it per se, just making very low values less interesting to iterate on by pushing the minimum reachable score away from 0. Look at the Pareto front for the last study:
image

You can see that it's really enamored with the idea of pushing KL divergence to 0, practically ignoring pushing for a lower number of refusals in comparison. Maybe my fix is too crude and will show the same clustering; checking it now.

@mateusz-malicki
Copy link

I mean, you have to choose arbitrary number for your "realistically low" kl-div. And its hard to do beforehand. Compare abliterating qwen vs gpt-oss. I do agree that sampling in this case is far from perfect, but i'm not sure if moving 0-point is the right solution.

@spikymoth
Copy link
Contributor Author

Making it configurable is easy enough. It can be kl_divergence_bias to complement kl_divergence_scale. If a low value like 0.01 makes a big difference then I think it's a reasonable default, but we can always default it to 0 as well.

Currently still testing with 0.01, the individual results look less clustered around a KL divergence of 0 but I'll wait to see the Pareto front graph once it finishes before drawing any conclusions. I'll test some other values too if it's promising, should take a few days.

@spikymoth spikymoth force-pushed the implement-mpoa branch 2 times, most recently from 3ebfdf3 to 408ab65 Compare November 29, 2025 22:58
@spikymoth
Copy link
Contributor Author

The idea of a bias didn't work - it seemed to learn to ignore it. I'm going to try a power-law transformation.

I'm also noticing that it really struggles to learn the relationship between direction_index and max_weight_position. When using a global direction_index, I think these should usually be closely aligned to maximize strength near the layer that the direction was taken from (rather than at some random other point). I might open a separate PR for that and some other refinements to the objective function, but I'll need to test them first.

@p-e-w
Copy link
Owner

p-e-w commented Dec 1, 2025

I'll try adding an offset of 0.01 to the KL divergence in the score to see if that encourages more aggressive solutions.

It won't. The TPE optimizer just tries to lower the score. It doesn't know or care what the minimum is. Shifting by a constant offset makes no difference at all, the GMMs will simply have shifted means.

You can see that it's really enamored with the idea of pushing KL divergence to 0, practically ignoring pushing for a lower number of refusals in comparison.

That's what the kl_divergence_scale configuration parameter is supposed to control. If you set it to a large value, the KLD part of the score will be less sensitive to changes, and since the optimization is not invariant to relative scaling of score dimensions, this should encourage the optimization to prioritize reducing refusals.

That being said, there is an inherent bias towards a KL divergence of 0, because that's what you get by simply doing nothing. In a way, reducing refusals is much harder than reducing the KL divergence.

@mateusz-malicki
Copy link

To be fair, you can achieve what you want, by using some offset > 0 AND using ABS(kl-div). But again - you will penalize kl-div below some arbitrary X - and i don't think that it will be effective.

@p-e-w
Copy link
Owner

p-e-w commented Dec 2, 2025

@mateusz-malicki

I did originally start out with a compound score, though with a nonlinear element intended to weaken the effect of low KL divergences (see 8a1acef for the commit that removed it). But this consistently gave worse results because it's hard for the optimizer to understand tradeoffs if all you have is a single score for steering.

@mateusz-malicki
Copy link

@p-e-w Oh, I totally agre on the fact that multi objective is better idea :)

Weighting kl-div vs refusals is hard, thats why you get pareto front from motpe instead of best result from tpe.

For exactly the same reason as here: We need arbitrary number (offset for kl-div in spikymoths solution or kl-div weight in case of [SO]TPE). And my point is: there is no such universally good number that will fit any model - its model specific; so you have to discover it; so we can see it as additional objective; and we already have motpe in place, so we can skip discovering perfect magic number and just let exploration of pareto front do its job ;) And there is additional point against penalization: there is really no guarantee that our search space is that smooth.

@spikymoth
Copy link
Contributor Author

Small update since it's been a few days: I'm running some somewhat orthogonal tests on parameter selection that I'm hoping will also make this more effective at finding strong solutions. I'll turn that into a separate PR soon.

Once that's done, I'll see about rebasing this on top of the recent changes.

@p-e-w
Copy link
Owner

p-e-w commented Dec 3, 2025

@spikymoth

Sounds interesting. That kind of exploratory work is very much appreciated!

@p-e-w
Copy link
Owner

p-e-w commented Dec 3, 2025

Here's a quick idea for a score that might work better than the current one:

Let KLD be the KL divergence, and R be the refusal rate (0 <= R <= 1). Let A be some configurable threshold.

Then the score is a tuple (a, b) with the following definition:

  • a = R
  • b = KLD if KLD >= A, and R * A if KLD < A

A is the "acceptable KLD threshold" below which you only care about minimizing refusals, and that's exactly what the score captures. If KLD < A, both score dimensions represent refusals, and the optimizer only optimizes for refusals. The expression R * A ensures that optimizing for refusals always wins for KLD < A, because R <= 1 by definition. Above A we get multi-objective optimization like before.

@p-e-w
Copy link
Owner

p-e-w commented Jan 23, 2026

The article introducing Norm-Preserving Biprojected Abliteration mentions specific (very high) scores for the abliterated version of Gemma-3 12B Instruct on the UGI and NatInt benchmarks.

It would be great if you could try to reproduce these scores, so we can quantify how close this implementation gets to the original technique.

This could also help to decide whether we should enable some of these new options by default.

@spikymoth
Copy link
Contributor Author

Ah, seems I'd need to publish an abliterated model and then request for it to be benchmarked by opening an HF discussion, as the questions are private to avoid gaming the system. Understandable, but also annoying compared to just running a test locally.

@anrp
Copy link
Contributor

anrp commented Jan 25, 2026

FYI I did some more testing with the tip of this branch on Qwen3-VL-30B-A3B-Thinking, and it turns out that the max_weight change was critical for getting refusals down. I'll collect some more data, but a quick vibe check shows that without this PR but with higher max_weight, the output is not good (hitting max token limit responding to initial "hi") but with this PR the output remains more stable.

@spikymoth
Copy link
Contributor Author

Could you try with high=2.0 for max_weight? In my tests with the latest version, that seems to be pretty effective (though I'm often letting it run much longer than the default 200 iterations, so that probably contributes as well).

@p-e-w
Copy link
Owner

p-e-w commented Jan 25, 2026

Ah, seems I'd need to publish an abliterated model and then request for it to be benchmarked by opening an HF discussion

Ah, I didn't know that. On the bright side, you don't have to run the benchmarks yourself.

Perhaps you can publish multiple versions with different parameter combinations to see how they affect the outcome.

@anrp
Copy link
Contributor

anrp commented Jan 25, 2026

Results, all for Qwen3-VL-30B-A3B-Thinking, top-4 pareto, "This PR" = "--row-normalization full --orthogonalize-direction":
Tip of master, max_weight=1.5

? Which trial do you want to use? (Use arrow keys)
 » [Trial  66] Refusals: 61/100, KL divergence: 0.0198
   [Trial 200] Refusals: 65/100, KL divergence: 0.0068
   [Trial 166] Refusals: 76/100, KL divergence: 0.0039
   [Trial 144] Refusals: 77/100, KL divergence: 0.0030

This PR, max_weight=1.5

? Which trial do you want to use? (Use arrow keys)
 » [Trial 186] Refusals: 24/100, KL divergence: 0.0089
   [Trial 193] Refusals: 26/100, KL divergence: 0.0078
   [Trial 188] Refusals: 27/100, KL divergence: 0.0071
   [Trial 192] Refusals: 28/100, KL divergence: 0.0055

Tip of master, max_weight=2.0

? Which trial do you want to use? (Use arrow keys)
 » [Trial  79] Refusals: 56/100, KL divergence: 0.0080
   [Trial 186] Refusals: 66/100, KL divergence: 0.0079
   [Trial 127] Refusals: 69/100, KL divergence: 0.0075
   [Trial 136] Refusals: 70/100, KL divergence: 0.0041

This PR, max_weight=2.0

? Which trial do you want to use? (Use arrow keys)
 » [Trial  74] Refusals: 16/100, KL divergence: 0.0077
   [Trial  73] Refusals: 19/100, KL divergence: 0.0065
   [Trial  75] Refusals: 21/100, KL divergence: 0.0045
   [Trial  55] Refusals: 47/100, KL divergence: 0.0033

Tip of master, max_weight=15 & log=True (vibe: 63,114,90 couldn't even respond to "hi" properly)

? Which trial do you want to use? (Use arrow keys)
 » [Trial  63] Refusals:  0/100, KL divergence: 0.5594
   [Trial 114] Refusals:  1/100, KL divergence: 0.1506
   [Trial 148] Refusals:  3/100, KL divergence: 0.0758
   [Trial  90] Refusals:  4/100, KL divergence: 0.0519

This PR, max_weight=15 & log=True (vibe: all responded to "hi" properly)

? Which trial do you want to use? (Use arrow keys)
 » [Trial 189] Refusals:  0/100, KL divergence: 0.8448
   [Trial  29] Refusals:  1/100, KL divergence: 0.0755
   [Trial 108] Refusals:  3/100, KL divergence: 0.0139
   [Trial  72] Refusals:  4/100, KL divergence: 0.0127

Attached the checkpoint files because that's now a thing 🎉 The optimizer is pushing against max_weight in both the 1.5 and 2.0 cases.
checkpoints.tar.gz

@spikymoth
Copy link
Contributor Author

spikymoth commented Jan 25, 2026

Cool! Was that with winsorization_quantile=1.0? winsorization_quantile=0.995 seems to help a lot with Gemma-3, though other architectures might not benefit.

Edit: Oh yeah, that's in the checkpoint file too - so it was with winsorization_quantile=1.0. 0.995 might be worth a shot!

@anrp
Copy link
Contributor

anrp commented Jan 25, 2026

This PR, max_weight=2.0, winsorization_quantile=0.995

? Which trial do you want to use? (Use arrow keys)
 » [Trial 174] Refusals:  3/100, KL divergence: 0.0078
   [Trial 178] Refusals:  4/100, KL divergence: 0.0074
   [Trial 183] Refusals:  5/100, KL divergence: 0.0070
   [Trial 181] Refusals:  6/100, KL divergence: 0.0068

checkpoints2.tar.gz

@spikymoth
Copy link
Contributor Author

Interesting! So winsorization is still highly relevant. Somewhat orthogonal to this PR, I've been trying to get an adaptive robust estimator working which might be a more principled alternative to winsorization. Unfortunately it's also quite complex in comparison, so it might be hard to justify. But the effectiveness of winsorization does seem to show that outliers can have a big impact on performance.

@p-e-w
Copy link
Owner

p-e-w commented Jan 26, 2026

I've been trying to get an adaptive robust estimator working which might be a more principled alternative to winsorization.

This was actually one of the reasons why I added the various metrics in the --print-residual-geometry output; I wanted to see if a better alternative to the difference-of-means exists.

The geometric median seemed like an obvious choice as it is much more robust to outliers. I did experiment with it a few times and the results were underwhelming, but perhaps it might perform better when combined with biprojection. The silhouette coefficient is certainly an interesting quantity, and appears to strongly correlate with the apparent cluster separation from PaCMAP. Excluding layers with low silhouette coefficients from modification seems like a promising idea to try.

As always, keep in mind that given how mind-bogglingly huge the problem space is, we cannot rely on empirical results alone. Whatever approach we choose must have some grounding in a theory of what is actually going on. It's too easy to overfit to a specific situation otherwise.

@spikymoth
Copy link
Contributor Author

Results with my robust estimator are also underwhelming; it doesn't seem to be a replacement for winsorization.

That seems to imply that there's something else going on; perhaps rather than handling outliers, winsorization damps the 1st order response of the model to the class of prompts (e.g. what it perceives the prompt to be about), allowing refusal to dominate as a 2nd order effect. I want to do some experiments with SVD to see if removing (or damping) the primary modes of variation have the same effect - but those certainly won't make it into this PR.

Adds setting winsorization_quantile, expressed as the quantile to clamp to.
- If set to a value below 1, the residuals obtained from evaluating the first token of the good and bad prompts are winsorized - that is, values outside the given quantile are clamped. Note that winsorization_quantile = 0.95 corresponds to a 90% winsorization.
Adds boolean setting orthogonalize_direction:
- When enabled, only the component of the refusal directions that is orthogonal to the harmless direction is subtracted during abliteration.

Adds enum-valued setting row_normalization:
- 'none': No normalization.
- 'pre': Row-normalize the weight matrix before computing the LoRA adapter.
- 'full': Like 'pre', but re-normalizes to preserve original row magnitudes.
@spikymoth
Copy link
Contributor Author

No changes since the last version, just rebased to pick up the recent changes to main (e.g. checkpointing).

Copy link
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full review. Looks excellent overall.

@spikymoth
Copy link
Contributor Author

Alright, I think that addresses everything :)

# Apply LoRA adapters to the CPU model

print("* Applying LoRA adapters...")
target_modules = self.get_abliterable_components()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually fixed a bug here by accident, because the component name splitting was not done here before, unlike during initial loading! 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, haha, I definitely didn't spot that but good to hear!

@p-e-w p-e-w merged commit 3525b1a into p-e-w:master Feb 2, 2026
4 checks passed
@p-e-w
Copy link
Owner

p-e-w commented Feb 2, 2026

Merged! This is a milestone for the project, and has been the most-requested feature on Reddit. Thank you for this excellent PR, and for seeing it through to the end.

I plan to ship this as-is in the upcoming Heretic 1.2, and possibly enable it by default in the version after that.

@spikymoth
Copy link
Contributor Author

Thank you for all the patience with the long draft stage! Feels good to get this in :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants