feat: avoid excessive low divergence iteration#73
Conversation
0d0aeaf to
2b129fe
Compare
|
The patch stack was getting very messy, so I rebased and merged the commits into logical units. Scoring now uses a hard transition from KL divergence to (scaled) refusal count. I renamed the setting to Before showing the trial results, we now filter out any trials from |
2b129fe to
0c74546
Compare
|
I thought about this some more, and we actually can't use The problem is that As currently implemented, it's possible for |
|
Alright, it looks like Pareto front calculation in Optuna (and maybe generally) is actually really simple. These are the relevant parts for our purposes:
Here So it does the following:
I implemented the same thing without numpy, which makes it even simpler. Performance should be fine for the number of trials we're dealing with here. |
|
Btw, feel free to join the Discord (link in README) for more real-time communication. I can often respond there more quickly than on GitHub. |
869d70f to
67bb7b0
Compare
Sure, joined just now :) I don't think I'll be super active, but I'll try to keep an eye on notifications and such. |
Adjusts the scoring function to avoid targeting meaninglessly low KL divergences. Below a threshold value, the KL divergence score switches to the refusal count. Adds config option kl_divergence_target (defaulting to 0.01).
Create variables for num_layers and last_layer_index * Improves readability and makes choices explicit
67bb7b0 to
9cd98b1
Compare
|
Merged! I appreciate your patience in seeing this through until it was correct in every way, that's super valuable. At some point in the future, I intend to test whether the |
|
Yes, I also wonder about other changes to the scoring function like applying a power law (with an exponent smaller than 1) to the refusal count, so reducing the number of refusals from 2 to 1 has more weight than a reduction from 20 to 19. But it's quite tricky to go from my little experiments to something production ready. |
Technically 2 features and some cleanup:
The 1st change is self-explanatory, although compared to the commit I pushed to #52 I simplified it a little by not using an enum. It's a bit more error prone this way because the same strings are repeated multiple times, but the enum also makes things more verbose.
The 2nd change implements your suggestion from #52 (comment), but I then added a utility function to that creates a smooth transition using a sigmoid function. Without this smoothing,
best_trialsseemed to get a bit confused and it wasn't as effective at avoiding low divergence values. With this change, I see a lot more trials that get the refusals score down, though mostly when turning upn_trialsor cheating by manually narrowing the parameter range ofdirection_index.The 3rd and 4th changes are together in 1 commit. By enabling grouping,
TPESamplerwill automatically divide the search space across categorical parameters. Sincedirection_indexis only used fordirection_scope == "global", that's exactly what we need when both direction scopes are enabled. It also means we don't have to setdirection_indexunconditionally (although I think that even if you don't setgroup=True, it still works, it just doesn't divide the search space).The other changes are mostly cleanup: I introduced variables
num_layersandlast_layer_indexso they aren't computed each time, and we can use one or the other depending on the application -last_layer_indexwhen we want to generate an index,num_layerswhen we want some fraction of the total size.I experimented a lot with different parameterizations, like turning
max_weight_positioninto an offset fromdirection_indexor splittingmin_weightandmin_weight_distanceup into forward and backward parts (making the weight asymmetric). I think the latter has some merit, but I'm not sure it's worth the cost of 2 extra parameters to optimize. Either way it was too speculative for this PR, so I limited the objective function changes to just cleanup.I've tested this independently from #52. I expect the change in scoring to help even more there, but I think it's a good change regardless.