In the current implementation, the Engram module appears to be trained with a 5× higher learning rate than the backbone and weight_decay=0. Could you clarify the motivation behind these choices? Specifically:
Does the higher LR help overcome gradient attenuation due to gating or late insertion in the network?
Is weight_decay=0 used to avoid regularizing discrete n-gram memory embeddings (which may harm capacity)?
Thanks!