Implement Magnitude-Preserving Orthogonal Ablation#52
Conversation
|
As you already hinted at, this isn't really compatible with Heretic's KL divergence score. The biprojection technique is supposed to make models smarter by changing model outputs even for "harmless" prompts (increasing the KL divergence). Thus the parameters that Heretic's optimizer converges to wouldn't be the ones that actually result in the claimed extra benefits from biprojection. If KL divergence is the correct metric, it doesn't really matter how you abliterate unless one abliteration technique results in lower KL divergences for a given refusal count. |
|
Yeah, I guess that while the refusal count is still an objective metric that makes sense (and is crucial for deciding which trial results to keep), the focus on KL divergence in determining the Pareto front of optimal solutions risks discarding good solutions that don't align with the metric (and currently there's no other way to distinguish them aside from chatting with the model). I suppose it still acts as a somewhat decent sanity check for whether a change specifically targets things that are considered harmful, or completely breaks the model's ability to respond. But it won't be able to measure whether responses to safe or borderline requests become better or worse. |
|
Trying it myself, so far I've noticed that the results seem more constrained with this method, with no results that show both a high KL divergence and a near-zero number of refusals. In terms of KL divergence, the "bang for buck" seems about the same or slightly worse. The constrained nature of the results does mean that at least 2 passes seem necessary to approach zero refusals. One thing I'm worried about is that the "per layer" direction scope might be outperforming the "global" direction scope in terms of KL divergence, whereas the global scope is closer to the MPOA algorithm. It might be a good idea to make the direction scope configurable (with values "global", "per layer" and "both", defaulting to "both"), and if both scopes are allowed, display the type in the results. I pushed a commit to do this to my development branch: https://github.com/spikymoth/heretic/tree/development |
Interestingly, this was completely wrong. In fact, it's the "global" results that dominate - after running 200 trials, there's not a single "per layer" result in the list of best trials. Still a small dataset, but pretty striking. Edit: I'm also realizing that I've significantly misunderstood how the optuna study actually works, and I need to reset part of my config to get a fair comparison to the original abliteration method. |
This is a complicated topic that I'm currently investigating in detail. The main problem is that some layers don't meaningfully separate residuals for good and bad prompts, so they cannot be self-abliterated. But then some layers in between such layers are important for model behavior, and absolutely do need to be abliterated. But since Heretic's weight kernels are unimodal, we cannot abliterate those layers without also abliterating the indeterminate layers that surround them, at least on one side. Thus per-layer abliteration gives poor results, and a global direction needs to be selected. |
Hmmm... I was experimenting with modeling ablation weight as polynomials. While it does not solves the main issue, it can help with such cases. huge downside of my approach is the number of parameters we must add - 200 trials isn't even close to "enough" anymore (modeling constraints with motpe sucks). |
|
I've found that with the settings from this PR and limiting the trials to the "global" direction scope, I was also seeing a lot of the better results in terms of lowering the number of refusals having a very high trial index. Currently doing a run with 500 trials and some Optuna visualization functions at the end to see how many trials are required for convergence, but it'll take a while to finish. |
|
Added some commits to make the direction scope configurable and add optional winsorization (magnitude clipping) for the residual tensors, applied before taking the mean (set it to 99 to match the 0.995 from the MPOA articles). I must confess that I haven't fully tested the latter, though it seems straightforward enough. I made the direction scope into an enum, which might be overkill but felt better than typing the same strings everywhere. |
|
I'm noticing that the optimization is wasting a lot of trials minimizing the KL divergence to tiny values like 0.002. I guess it must be tempting to drive it to 0 to minimize the scalarized objective, but values that low are pretty meaningless. I'll try adding an offset of 0.01 to the KL divergence in the score to see if that encourages more aggressive solutions. |
|
How? Want to penalize "too low" kl-div? Better idea is to penalize too high refusals count on kl score, so those regions will be less atractive from sampler perspective - but it is still not the best idea im my opinion. Weighting kl-div vs refusals is hard, thats why you get pareto front from motpe instead of best result from tpe. |
648d617 to
e6d4004
Compare
|
I mean, you have to choose arbitrary number for your "realistically low" kl-div. And its hard to do beforehand. Compare abliterating qwen vs gpt-oss. I do agree that sampling in this case is far from perfect, but i'm not sure if moving 0-point is the right solution. |
|
Making it configurable is easy enough. It can be Currently still testing with 0.01, the individual results look less clustered around a KL divergence of 0 but I'll wait to see the Pareto front graph once it finishes before drawing any conclusions. I'll test some other values too if it's promising, should take a few days. |
3ebfdf3 to
408ab65
Compare
|
The idea of a bias didn't work - it seemed to learn to ignore it. I'm going to try a power-law transformation. I'm also noticing that it really struggles to learn the relationship between |
It won't. The TPE optimizer just tries to lower the score. It doesn't know or care what the minimum is. Shifting by a constant offset makes no difference at all, the GMMs will simply have shifted means.
That's what the That being said, there is an inherent bias towards a KL divergence of 0, because that's what you get by simply doing nothing. In a way, reducing refusals is much harder than reducing the KL divergence. |
|
To be fair, you can achieve what you want, by using some offset > 0 AND using ABS(kl-div). But again - you will penalize kl-div below some arbitrary X - and i don't think that it will be effective. |
|
I did originally start out with a compound score, though with a nonlinear element intended to weaken the effect of low KL divergences (see 8a1acef for the commit that removed it). But this consistently gave worse results because it's hard for the optimizer to understand tradeoffs if all you have is a single score for steering. |
|
@p-e-w Oh, I totally agre on the fact that multi objective is better idea :)
For exactly the same reason as here: We need arbitrary number (offset for kl-div in spikymoths solution or kl-div weight in case of [SO]TPE). And my point is: there is no such universally good number that will fit any model - its model specific; so you have to discover it; so we can see it as additional objective; and we already have motpe in place, so we can skip discovering perfect magic number and just let exploration of pareto front do its job ;) And there is additional point against penalization: there is really no guarantee that our search space is that smooth. |
|
Small update since it's been a few days: I'm running some somewhat orthogonal tests on parameter selection that I'm hoping will also make this more effective at finding strong solutions. I'll turn that into a separate PR soon. Once that's done, I'll see about rebasing this on top of the recent changes. |
|
Sounds interesting. That kind of exploratory work is very much appreciated! |
|
Here's a quick idea for a score that might work better than the current one: Let Then the score is a tuple
|
8380471 to
3a2f247
Compare
3a2f247 to
63e4631
Compare
|
The article introducing Norm-Preserving Biprojected Abliteration mentions specific (very high) scores for the abliterated version of Gemma-3 12B Instruct on the UGI and NatInt benchmarks. It would be great if you could try to reproduce these scores, so we can quantify how close this implementation gets to the original technique. This could also help to decide whether we should enable some of these new options by default. |
|
Ah, seems I'd need to publish an abliterated model and then request for it to be benchmarked by opening an HF discussion, as the questions are private to avoid gaming the system. Understandable, but also annoying compared to just running a test locally. |
|
FYI I did some more testing with the tip of this branch on Qwen3-VL-30B-A3B-Thinking, and it turns out that the max_weight change was critical for getting refusals down. I'll collect some more data, but a quick vibe check shows that without this PR but with higher max_weight, the output is not good (hitting max token limit responding to initial "hi") but with this PR the output remains more stable. |
|
Could you try with |
Ah, I didn't know that. On the bright side, you don't have to run the benchmarks yourself. Perhaps you can publish multiple versions with different parameter combinations to see how they affect the outcome. |
|
Results, all for Qwen3-VL-30B-A3B-Thinking, top-4 pareto, "This PR" = "--row-normalization full --orthogonalize-direction": This PR, max_weight=1.5 Tip of master, max_weight=2.0 This PR, max_weight=2.0 Tip of master, max_weight=15 & log=True (vibe: 63,114,90 couldn't even respond to "hi" properly) This PR, max_weight=15 & log=True (vibe: all responded to "hi" properly) Attached the checkpoint files because that's now a thing 🎉 The optimizer is pushing against max_weight in both the 1.5 and 2.0 cases. |
|
Cool! Was that with Edit: Oh yeah, that's in the checkpoint file too - so it was with |
|
This PR, max_weight=2.0, winsorization_quantile=0.995 |
|
Interesting! So winsorization is still highly relevant. Somewhat orthogonal to this PR, I've been trying to get an adaptive robust estimator working which might be a more principled alternative to winsorization. Unfortunately it's also quite complex in comparison, so it might be hard to justify. But the effectiveness of winsorization does seem to show that outliers can have a big impact on performance. |
This was actually one of the reasons why I added the various metrics in the The geometric median seemed like an obvious choice as it is much more robust to outliers. I did experiment with it a few times and the results were underwhelming, but perhaps it might perform better when combined with biprojection. The silhouette coefficient is certainly an interesting quantity, and appears to strongly correlate with the apparent cluster separation from PaCMAP. Excluding layers with low silhouette coefficients from modification seems like a promising idea to try. As always, keep in mind that given how mind-bogglingly huge the problem space is, we cannot rely on empirical results alone. Whatever approach we choose must have some grounding in a theory of what is actually going on. It's too easy to overfit to a specific situation otherwise. |
|
Results with my robust estimator are also underwhelming; it doesn't seem to be a replacement for winsorization. That seems to imply that there's something else going on; perhaps rather than handling outliers, winsorization damps the 1st order response of the model to the class of prompts (e.g. what it perceives the prompt to be about), allowing refusal to dominate as a 2nd order effect. I want to do some experiments with SVD to see if removing (or damping) the primary modes of variation have the same effect - but those certainly won't make it into this PR. |
Adds setting winsorization_quantile, expressed as the quantile to clamp to. - If set to a value below 1, the residuals obtained from evaluating the first token of the good and bad prompts are winsorized - that is, values outside the given quantile are clamped. Note that winsorization_quantile = 0.95 corresponds to a 90% winsorization.
Adds boolean setting orthogonalize_direction: - When enabled, only the component of the refusal directions that is orthogonal to the harmless direction is subtracted during abliteration. Adds enum-valued setting row_normalization: - 'none': No normalization. - 'pre': Row-normalize the weight matrix before computing the LoRA adapter. - 'full': Like 'pre', but re-normalizes to preserve original row magnitudes.
dc875dd to
36e49d5
Compare
|
No changes since the last version, just rebased to pick up the recent changes to main (e.g. checkpointing). |
p-e-w
left a comment
There was a problem hiding this comment.
Full review. Looks excellent overall.
|
Alright, I think that addresses everything :) |
| # Apply LoRA adapters to the CPU model | ||
|
|
||
| print("* Applying LoRA adapters...") | ||
| target_modules = self.get_abliterable_components() |
There was a problem hiding this comment.
You actually fixed a bug here by accident, because the component name splitting was not done here before, unlike during initial loading! 😄
There was a problem hiding this comment.
Oops, haha, I definitely didn't spot that but good to hear!
|
Merged! This is a milestone for the project, and has been the most-requested feature on Reddit. Thank you for this excellent PR, and for seeing it through to the end. I plan to ship this as-is in the upcoming Heretic 1.2, and possibly enable it by default in the version after that. |
|
Thank you for all the patience with the long draft stage! Feels good to get this in :) |

That is, https://huggingface.co/blog/grimjim/projected-abliteration and https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration with the updated name from https://huggingface.co/posts/grimjim/803126534676334 :)
Marking this as a draft mostly because I don't know the timeline for #43 - either this will need to be rebased on that, or that on this. But the code changes here are relatively small, so I think either direction should be simple enough.
A few notes:
Additional caveats: