Skip to content

feat: avoid excessive low divergence iteration#73

Merged
p-e-w merged 3 commits intop-e-w:masterfrom
spikymoth:cleanup-objective
Dec 14, 2025
Merged

feat: avoid excessive low divergence iteration#73
p-e-w merged 3 commits intop-e-w:masterfrom
spikymoth:cleanup-objective

Conversation

@spikymoth
Copy link
Contributor

@spikymoth spikymoth commented Dec 5, 2025

Technically 2 features and some cleanup:

  1. Make direction scope configurable
  2. Adjust objective to discourage iteration near 0 KL divergence
  3. Enable grouping in sampler
  4. Clean up the objective function

The 1st change is self-explanatory, although compared to the commit I pushed to #52 I simplified it a little by not using an enum. It's a bit more error prone this way because the same strings are repeated multiple times, but the enum also makes things more verbose.

The 2nd change implements your suggestion from #52 (comment), but I then added a utility function to that creates a smooth transition using a sigmoid function. Without this smoothing, best_trials seemed to get a bit confused and it wasn't as effective at avoiding low divergence values. With this change, I see a lot more trials that get the refusals score down, though mostly when turning up n_trials or cheating by manually narrowing the parameter range of direction_index.

The 3rd and 4th changes are together in 1 commit. By enabling grouping, TPESampler will automatically divide the search space across categorical parameters. Since direction_index is only used for direction_scope == "global", that's exactly what we need when both direction scopes are enabled. It also means we don't have to set direction_index unconditionally (although I think that even if you don't set group=True, it still works, it just doesn't divide the search space).

The other changes are mostly cleanup: I introduced variables num_layers and last_layer_index so they aren't computed each time, and we can use one or the other depending on the application - last_layer_index when we want to generate an index, num_layers when we want some fraction of the total size.

I experimented a lot with different parameterizations, like turning max_weight_position into an offset from direction_index or splitting min_weight and min_weight_distance up into forward and backward parts (making the weight asymmetric). I think the latter has some merit, but I'm not sure it's worth the cost of 2 extra parameters to optimize. Either way it was too speculative for this PR, so I limited the objective function changes to just cleanup.

I've tested this independently from #52. I expect the change in scoring to help even more there, but I think it's a good change regardless.

Copy link
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@spikymoth spikymoth force-pushed the cleanup-objective branch 2 times, most recently from 0d0aeaf to 2b129fe Compare December 9, 2025 19:37
@spikymoth
Copy link
Contributor Author

The patch stack was getting very messy, so I rebased and merged the commits into logical units.

Scoring now uses a hard transition from KL divergence to (scaled) refusal count. I renamed the setting to kl_divergence_min and updated the comment.

Before showing the trial results, we now filter out any trials from best_trials that have a worse divergence with the same refusal count.

@p-e-w
Copy link
Owner

p-e-w commented Dec 11, 2025

I thought about this some more, and we actually can't use study.best_trials at all with this approach.

The problem is that best_trials contains the Pareto front of the study – but that Pareto front is based on an objective that doesn't match our actual preferences. Note that we still want to rank trials by (kld, refusals) for the purpose of picking the best one, we just modify the optimization score to encourage convergence towards lower refusals.

As currently implemented, it's possible for best_trials to exclude trials that are on the Pareto front we care about, because it is based on the Pareto front the optimizer sees.

@spikymoth
Copy link
Contributor Author

Alright, it looks like Pareto front calculation in Optuna (and maybe generally) is actually really simple. These are the relevant parts for our purposes:

Here loss_values are just the scores, since we're minimizing.

So it does the following:

  1. Get the unique, lexicographically sorted values (in our case, sorted by KL divergence score, then by refusal score)
  2. Get the cumulative minima for the 2nd objective
  3. Mask off trials that don't improve the minimum

I implemented the same thing without numpy, which makes it even simpler. Performance should be fine for the number of trials we're dealing with here.

@p-e-w
Copy link
Owner

p-e-w commented Dec 12, 2025

Btw, feel free to join the Discord (link in README) for more real-time communication. I can often respond there more quickly than on GitHub.

@spikymoth spikymoth changed the title feat: make direction scope configurable, improve scoring feat: avoid excessive low divergence iteration Dec 12, 2025
@spikymoth
Copy link
Contributor Author

Btw, feel free to join the Discord (link in README) for more real-time communication. I can often respond there more quickly than on GitHub.

Sure, joined just now :) I don't think I'll be super active, but I'll try to keep an eye on notifications and such.

Adjusts the scoring function to avoid targeting meaninglessly low KL divergences.
Below a threshold value, the KL divergence score switches to the refusal count.
Adds config option kl_divergence_target (defaulting to 0.01).
Create variables for num_layers and last_layer_index
* Improves readability and makes choices explicit
@p-e-w p-e-w merged commit 9d17348 into p-e-w:master Dec 14, 2025
4 checks passed
@p-e-w
Copy link
Owner

p-e-w commented Dec 14, 2025

Merged! I appreciate your patience in seeing this through until it was correct in every way, that's super valuable.

At some point in the future, I intend to test whether the kl_divergence_scale parameter actually carries its weight. Scale does matter with multi-objective TPE because it uses hypervolumes to determine overall improvement, but in practice, the KLD is usually on the scale of 1 for the type of optimization we do, so we might not need this to be configurable after all.

@spikymoth
Copy link
Contributor Author

Yes, I also wonder about other changes to the scoring function like applying a power law (with an exponent smaller than 1) to the refusal count, so reducing the number of refusals from 2 to 1 has more weight than a reduction from 20 to 19. But it's quite tricky to go from my little experiments to something production ready.

@spikymoth spikymoth deleted the cleanup-objective branch January 3, 2026 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants