add support for hybrid layer models: conv, mamba#43
add support for hybrid layer models: conv, mamba#43pszemraj wants to merge 6 commits intop-e-w:masterfrom
Conversation
Signed-off-by: Peter Szemraj <74869040+pszemraj@users.noreply.github.com>
- Extract mlp.shared_down_proj, block_sparse_moe.output_linear across variants - Discover components dynamically per-layer for distinct Optuna parameter tracking - Separate abliteration stats: "no matrices" vs. "outside configured ranges" - Apply dedicated parameter ranges to shared MLP components Signed-off-by: Peter Szemraj <74869040+pszemraj@users.noreply.github.com>
- Extract LFM-specific projections: self_attn.out_proj, layer.feed_forward.w2/output_linear - Cover LFM conv layers and MoE expert paths - Maintain conv/mamba/Granite handling Signed-off-by: Peter Szemraj <74869040+pszemraj@users.noreply.github.com>
…trials Signed-off-by: Peter Szemraj <74869040+pszemraj@users.noreply.github.com>
- Auto-enable do_sample when sampling fields (temperature/top_k/top_p/typical_p) present - Validate config to prevent HF validation errors on upload Signed-off-by: Peter Szemraj <74869040+pszemraj@users.noreply.github.com>
|
some quick points/follow ups @p-e-w :
|
p-e-w
left a comment
There was a problem hiding this comment.
I'm super excited about this PR! There have been very few successful abliterations of non-traditional architectures so far, and it's amazing that Heretic will have this ability soon!
My comments are just an initial review based on a first look through the code. My most important request is that you split this PR into at least 3 PRs:
- One with the trial filtering
- One with the generation config sanitization
- One with the changes expanding model support
This is a complex change, and I would like to discuss it in semantically distinct units.
As for your questions:
the package version should probably be bumped to 1.0.2 or so after this. do you want me to update the code or you have some process/want to edit yourself?
I intend to publish version 1.1.0 next week or so, incorporating all the improvements so far, hopefully including this one :) I will take care of the version bump after wrapping everything up.
you can get slightly better results for lfm2 models with even more dynamic/updated value ranges. check out that gist, but I decided not to include in this PR as it makes things a more complex and can impact search space time/quality for standard models
In principle, it makes sense to have component-dependent ranges for all parameter types if you have them for some of them. However, in my experience it's very difficult to determine what ranges are good to begin with. I've done well over a hundred full optimization runs and I'm still not sure about some of them.
i have a list of ideas/nice-to-haves that i thought of when implementing this PR+standalone usage, want me to create an issue or how best to share?
I'm excited to hear those ideas! Please file issues to discuss them.
One important consideration is that as the number of components (and thus parameter sets) exceeds 2, the number of trials needed to converge to a good result also increases. The folklore rule of thumb for TPE is that you should have at least 10-20 trials per dimension, which matches my experience for this optimization problem. While I don't think that n_trials should be increased automatically, it's probably a good idea to warn the user if n_components > 2 and n_trials <= 200.
| description="Dataset of prompts that tend to result in refusals (used for evaluating model performance).", | ||
| ) | ||
|
|
||
| show_all_trials: bool = Field( |
There was a problem hiding this comment.
What is the reason for this change? Were there too many trials on the Pareto front?
There was a problem hiding this comment.
(responding to comments as shown top to bottom on GitHub mobile, this is the first one)
This is intended as a fallback option to a filtering option I added for KL divergence of results that could be returned (iirc prior to presentation as "here's the KL div:refusals Pareto optimal choices"). Some details
- this has to do with strange phenomena (described below) that I was encountering with the LFM2 models, not any defaults
- I added a filter to exclude trials with KL divergence > 2.0 prior to any consideration of what appears in the result list/table
- the
--show_all_trialsflag mainly exists to revert this ^ behavior if desired and return to default
on strange conv optimization
Models with conv layers, specifically LFM2, are hard to optimize past a certain 'minimum threshold' of refusals. Past, say, 30-40 refusals/100 the KL divergence will spike dramatically, and the tradeoff KL div/refusals becomes much worse:
- for trials that achieve < 30 refusals, for LFM2-2.6B (and the next smallest one) the KL divergence will become 2, 5, sometimes 10+ towards the < 5/100 refusals mark
- it does not seem to be a "number of trials" problem. I tried running for much longer, didn't seem to make much, if any, progress
- other than "alternative"/hybrid layers included in this PR (mamba) do not exhibit this behavior and can be optimized and Abliterated just fine
- my current best theory on why is that the assumption of "refusal is a single direction" is not true, or fully true, in conv layers and some
Of the model's refusal/alignment does in fact 'live there'
Long story short: with hybrid models/conv layers (LFM2) it was common to see KL of 2+ in trial list. I filtered those out by default (destroy model too much) and added this flag in case someone doesn't want to do that/original behavior
| f"{component}.min_weight", | ||
| 0.0, | ||
| 1.0, | ||
| min_weight_fraction_range[0], |
There was a problem hiding this comment.
👍 min_weight_fraction is a better name than min_weight here, we should probably change the variable name itself as well.
| key=lambda trial: trial.user_attrs["refusals"], | ||
| key=lambda trial: ( | ||
| trial.user_attrs["refusals"], | ||
| max(trial.user_attrs["kl_divergence"], 0), |
There was a problem hiding this comment.
KL divergences cannot be negative by construction. Did you encounter negative values? If so, that would indicate a serious bug in KLD computation.
There was a problem hiding this comment.
Agreed. This can go, iirc was a safeguard/overkill ai suggestion that I kept in
| max_weight_range = (0.8, 2.0) | ||
| min_weight_fraction_range = (0.0, 0.9) | ||
| else: | ||
| max_weight_range = (0.8, 2.5) |
There was a problem hiding this comment.
Doesn't this case include the MLP projections? Based on my tests, 2.5 is much higher than needed. Did you see different behavior?
I wonder if those ranges should be moved into get_abliterable_compoents, to keep this code component-agnostic. Better yet, they should probably be in a map in the configuration.
There was a problem hiding this comment.
I'll give you more info on this later (flight about to take off) but the 2.5/much larger range is to help with hybrid layers only (or at least that is my intention), and specifically conv layer/hybrid weird/different behavior (detailed in long comment on show_all_trials)
| f"KL divergence: {trial.user_attrs['kl_divergence']:.2f}" | ||
| ), | ||
| value=trial, | ||
| kl_threshold = 2.0 |
|
|
||
| def try_add(component: str, matrix: Any): | ||
| if matrix is None: | ||
| return |
|
|
||
| assert torch.is_tensor(matrix) | ||
| if not torch.is_tensor(matrix): | ||
| return |
There was a problem hiding this comment.
Why did you change this? We suppress exceptions everywhere now, right?
| with suppress(Exception): | ||
| try_add("attn.o_proj", layer.self_attn.o_proj.weight) | ||
| with suppress(Exception): | ||
| try_add("attn.o_proj", layer.self_attn.out_proj.weight) |
There was a problem hiding this comment.
LFM2/LFM2-VL (one of them) does this
| # Granite MoE Hybrid - attention layers with shared_mlp. | ||
| with suppress(Exception): | ||
| try_add("mlp.down_proj", layer.shared_mlp.output_linear.weight) | ||
| try_add("mlp.shared_down_proj", layer.shared_mlp.output_linear.weight) |
There was a problem hiding this comment.
Why the name distinction? I aliased multiple naming conventions together to simplify configuration and optimization. The assumption was that semantically related components would react similarly to ablation. Did you observe something different?
| try_add("mlp.down_proj", expert.down_proj.weight) | ||
| with suppress(Exception): | ||
| for expert in layer.feed_forward.experts: | ||
| try_add("mlp.down_proj", expert.w2.weight) |
There was a problem hiding this comment.
Which models use down_proj and which ones use w2?
|
fair enough! let me take this back and rework. I appreciate all the comments, might take me a few days to get to reworking this and addressing them it but I will |
|
(Finally getting a chance to respond to some comments etc. doing that on here for the sake of centralization/continuity but will split the work/PR itself into 3+ PRs as suggested) |
|
Sounds great! Just a heads up, there is now #60, which touches a lot of the same code as this PR, so this might get a bit messy. I'll merge whichever change is ready first, and then the other one will have to be adapted accordingly. |
|
no problem! due to time constraints/just getting back from holiday travels, happy to let 60 get merged and play clean up crew. I'll target this weekend-ish for the first 1-2 PRs you discussed (that are still relevant after that one) |
1 similar comment
|
no problem! due to time constraints/just getting back from holiday travels, happy to let 60 get merged and play clean up crew. I'll target this weekend-ish for the first 1-2 PRs you discussed (that are still relevant after that one) |
Thanks, I really appreciate it! Looking forward to seeing your hybrid/Mamba support land on top of this. :) |
…thod - Add explanatory comments for warning suppression and gc behavior - Remove redundant gc.collect() calls (empty_cache handles it) - Fix output message order (ask merge strategy before 'Uploading...') - Add comment explaining 8-bit quantization doesn't need compute_dtype - Remove extra newline after dtype comment - Add future-proofing note for hybrid layer support (p-e-w#43) - Remove leftover comment in get_merged_model - Delete test_lora.py (debug script, not a real test) - Add comment explaining needs_reload flag purpose - Extract quantization config into _get_quantization_config() helper - Rename reload_model() to reset_model_for_trial() for clarity - Fix reload_model to respect quantization config (fixes evaluate_model bug) - Remove unused gc import
* Add files via upload * perf: optimize abliteration matrix op (#46) * perf: optimize abliteration matrix op * refactor: comments and var names correspond with arditi * refactor: fix comments and improve var notation * fix: accidental line change and improve comments --------- Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com> * Fix line endings to LF * Add hybrid approach for GPT-OSS compatibility - Check for LoRA adapters before attempting LoRA abliteration - Fall back to direct weight modification for nn.Parameter (GPT-OSS) - Ensures compatibility across all model architectures * Fix projector bug, update print statement, revert README * Revert README changes to match upstream * Fix import sorting for ruff * Fix reload_model for evaluate_model, add type hints and validation * Apply ruff formatting * Replace load_in_4bit with quantization enum * Fix precision loss: use FP32 refusal direction directly * Move r assignment into non-LoRA path * Fix linting: apply ruff formatting * Add auto-merge for LoRA adapters on save/upload * Fix linting: apply ruff formatting * Implement CPU-based merge for 4-bit models with OOM fallback * Remove use_lora flag (LoRA always on), add user prompt for 4-bit export * Fix: PEFT target_modules expects module names without path prefix * Fix linting: apply ruff formatting * Add LoRA fallback and fix quantization_config handling - Add try/except around LoRA initialization with fallback to direct weight modification - Only pass quantization_config when not None (fixes gpt-oss loading) - Use simple forward pass instead of generate() for model test (avoids chat template issues) - Reset non-LoRA models by reloading in reload_model() - Check self.use_lora before accessing LoRA adapters in abliterate() * Add 8-bit quantization support via bitsandbytes - Add BNB_8BIT option to QuantizationMethod enum - Add --load-in-8bit CLI support (auto via pydantic-settings) - Update documentation in config.py and config.default.toml - Useful for mid-range VRAM (12-16 GB) as balance between memory and numeric stability * Improve LoRA merge warning and fix linting * Apply final ruff formatting * Fix CI: apply ruff import sorting * Use tiny model for CI efficiency * Fix import sorting in test_lora.py * Fix formatting in test_lora.py * feat: Show merge warning for all models (requires high RAM) * style: Apply ruff fixes * Fix undefined Style import in main.py * Fix(model): Support MoE/3D tensors and enforce dtype safety in abliterate * Fix(ci): Format model.py with ruff * Fix(main): Remove invalid style argument from prompt_select and unused import * Fix logic errors, memory leak, and redundant merges in main.py * Fix linting and formatting issues (isort, ruff) * chore: Simplify .gitattributes as requested * refactor: Remove defensive try-except around LoRA initialization * chore: Update uv.lock with peft and bitsandbytes * chore: Regenerate uv.lock to include missing peft dependency * style: Fix import sorting (isort) for CI compliance * style: Simplify .gitattributes to single line as requested * Address PR #60 feedback: Remove caching, fix LoRA reload, global LoRA usage, style fixes * Address PR review comments: clarify code, fix quantization, rename method - Add explanatory comments for warning suppression and gc behavior - Remove redundant gc.collect() calls (empty_cache handles it) - Fix output message order (ask merge strategy before 'Uploading...') - Add comment explaining 8-bit quantization doesn't need compute_dtype - Remove extra newline after dtype comment - Add future-proofing note for hybrid layer support (#43) - Remove leftover comment in get_merged_model - Delete test_lora.py (debug script, not a real test) - Add comment explaining needs_reload flag purpose - Extract quantization config into _get_quantization_config() helper - Rename reload_model() to reset_model_for_trial() for clarity - Fix reload_model to respect quantization config (fixes evaluate_model bug) - Remove unused gc import * Restore gc.collect() before empty_cache() for large models * refactor: Remove LoRA fallback remnants, simplify code - Remove use_lora flag (always true since LoRA is always applied) - Remove isinstance(PeftModel) check in get_merged_model() (always true) - Simplify reset_model_for_trial() by removing defensive try/except - Remove redundant gc.collect() calls (empty_cache handles GC) - Remove unused gc import from main.py * Address p-e-w review feedback: rename reset_model, remove loaded_model_name, fix type hints, remove GPT-OSS MoE, update assertion * Restore skip logic for non-LoRA modules and fix 4-bit base_layer.weight access * Remove defensive lora_A check per review - get_layer_modules already filters * Fix try_add: nest component init inside Module check, add assert for unexpected types * Add note about module.weight assumption for type checking * Change 'Reloading model' to 'Resetting model' in logging --------- Co-authored-by: accemlcc <accemlcc@users.noreply.github.com> Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com> Co-authored-by: Hager <Michael.Hager@bruker.com>
|
great! I did start taking a look, in the process of that though/incidentally testing out v1.1.0 I realized that there is a (hopefully minor) consistency issue and opened #88 which IMO should be handled first (otherwise, I can't reliably tell you whether parameter ranges need to be expanded for conv-hybrid models like LFM2, so on) other quick comments:
|
Sounds perfect! |
|
/gemini summary |
Summary of ChangesThis pull request significantly extends the model compatibility of the system by introducing robust support for hybrid layer architectures, including those with convolution and Mamba components. It refines the process of identifying abliterable components across various complex model structures and dynamically adjusts optimization parameters to suit these new layer types. Furthermore, it enhances the user experience by providing more insightful trial selection options, including filtering based on KL divergence, and ensures smoother model saving and uploading through automatic generation configuration sanitization. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly extends the tool's capabilities by adding support for hybrid models with mamba and convolution layers, such as Granite and LFM. The changes are well-structured and include improved component detection, tuned hyperparameter search ranges, and better user experience in trial selection. I've provided a few suggestions to enhance code maintainability and fix a potential bug in how model components are detected within lists of experts. Overall, this is a great contribution that makes the tool more powerful and robust.
| with suppress(Exception): | ||
| for expert in layer.moe.experts: | ||
| try_add("mlp.down_proj", expert.output_linear.weight) |
There was a problem hiding this comment.
The with suppress(Exception) block is outside the for loop. If an exception occurs for one of the experts (e.g., it has an unexpected structure), the entire loop will be aborted, and subsequent experts won't be processed. This could lead to an incomplete list of abliterable components.
To make this more robust, the suppress block should be moved inside the loop to handle exceptions on a per-expert basis. This pattern appears in other parts of this method when iterating over experts, and those should be updated similarly for robustness.
| with suppress(Exception): | |
| for expert in layer.moe.experts: | |
| try_add("mlp.down_proj", expert.output_linear.weight) | |
| with suppress(Exception): | |
| for expert in layer.moe.experts: | |
| with suppress(Exception): | |
| try_add("mlp.down_proj", expert.output_linear.weight) |
| for expert in layer.feed_forward.experts: | ||
| try_add("mlp.down_proj", expert.down_proj.weight) | ||
| with suppress(Exception): | ||
| for expert in layer.feed_forward.experts: | ||
| try_add("mlp.down_proj", expert.w2.weight) |
There was a problem hiding this comment.
These two loops are inefficient and have the same issue I mentioned in another comment: the suppress(Exception) is outside the loop, making it not robust to failures with individual experts.
You can combine these into a single, more robust loop. This avoids iterating over the experts twice and ensures that if one attribute access fails for an expert, you still attempt to access the other.
| for expert in layer.feed_forward.experts: | |
| try_add("mlp.down_proj", expert.down_proj.weight) | |
| with suppress(Exception): | |
| for expert in layer.feed_forward.experts: | |
| try_add("mlp.down_proj", expert.w2.weight) | |
| with suppress(Exception): | |
| for expert in layer.feed_forward.experts: | |
| with suppress(Exception): | |
| try_add("mlp.down_proj", expert.down_proj.weight) | |
| with suppress(Exception): | |
| try_add("mlp.down_proj", expert.w2.weight) |
| if "conv" in component or "mamba" in component: | ||
| max_weight_range = (0.8, 2.5) | ||
| min_weight_fraction_range = (0.0, 0.8) | ||
| elif "attn" in component: | ||
| max_weight_range = (0.8, 1.8) | ||
| min_weight_fraction_range = (0.0, 1.0) | ||
| elif "shared" in component: | ||
| max_weight_range = (0.8, 2.0) | ||
| min_weight_fraction_range = (0.0, 0.9) | ||
| else: | ||
| max_weight_range = (0.8, 2.5) | ||
| min_weight_fraction_range = (0.0, 0.9) |
There was a problem hiding this comment.
This if/elif/else block for setting parameter ranges contains some redundancy. You can make it more concise and maintainable by setting default values first and then overriding them only where they differ in the conditional branches. This makes the logic for the default case more explicit and reduces repeated code.
| if "conv" in component or "mamba" in component: | |
| max_weight_range = (0.8, 2.5) | |
| min_weight_fraction_range = (0.0, 0.8) | |
| elif "attn" in component: | |
| max_weight_range = (0.8, 1.8) | |
| min_weight_fraction_range = (0.0, 1.0) | |
| elif "shared" in component: | |
| max_weight_range = (0.8, 2.0) | |
| min_weight_fraction_range = (0.0, 0.9) | |
| else: | |
| max_weight_range = (0.8, 2.5) | |
| min_weight_fraction_range = (0.0, 0.9) | |
| # Default ranges, also used for MLP components | |
| max_weight_range = (0.8, 2.5) | |
| min_weight_fraction_range = (0.0, 0.9) | |
| if "conv" in component or "mamba" in component: | |
| min_weight_fraction_range = (0.0, 0.8) | |
| elif "attn" in component: | |
| max_weight_range = (0.8, 1.8) | |
| min_weight_fraction_range = (0.0, 1.0) | |
| elif "shared" in component: | |
| max_weight_range = (0.8, 2.0) |
| sampling_fields = [] | ||
| if getattr(gen_config, "temperature", None) not in (None, 1.0): | ||
| sampling_fields.append("temperature") | ||
| if getattr(gen_config, "top_k", None) not in (None, 0): | ||
| sampling_fields.append("top_k") | ||
| if getattr(gen_config, "top_p", None) not in (None, 1.0): | ||
| sampling_fields.append("top_p") | ||
| if getattr(gen_config, "typical_p", None) not in (None, 1.0): | ||
| sampling_fields.append("typical_p") | ||
|
|
||
| if sampling_fields and not bool(getattr(gen_config, "do_sample", False)): | ||
| gen_config.do_sample = True |
There was a problem hiding this comment.
The logic to check for sampling parameters can be made more concise. Instead of building a list of fields just to check if it's non-empty, you can use a boolean expression directly. This improves readability and is slightly more efficient.
| sampling_fields = [] | |
| if getattr(gen_config, "temperature", None) not in (None, 1.0): | |
| sampling_fields.append("temperature") | |
| if getattr(gen_config, "top_k", None) not in (None, 0): | |
| sampling_fields.append("top_k") | |
| if getattr(gen_config, "top_p", None) not in (None, 1.0): | |
| sampling_fields.append("top_p") | |
| if getattr(gen_config, "typical_p", None) not in (None, 1.0): | |
| sampling_fields.append("typical_p") | |
| if sampling_fields and not bool(getattr(gen_config, "do_sample", False)): | |
| gen_config.do_sample = True | |
| has_sampling_params = ( | |
| getattr(gen_config, "temperature", 1.0) not in (None, 1.0) | |
| or getattr(gen_config, "top_k", 0) not in (None, 0) | |
| or getattr(gen_config, "top_p", 1.0) not in (None, 1.0) | |
| or getattr(gen_config, "typical_p", 1.0) not in (None, 1.0) | |
| ) | |
| if has_sampling_params and not getattr(gen_config, "do_sample", False): | |
| gen_config.do_sample = True |
this PR adds support for new layer types: mamba, convolution layers as well as for hybrid layers (vary throughout model). This enables use for models like granite-4.0 , LFM2 etc.