Why use 5× learning rate and zero weight decay for Engram parameters?

In the current implementation, the Engram module appears to be trained with a **5×** higher learning rate than the backbone and **weight_decay=0**. Could you clarify the motivation behind these choices? Specifically:

Does the higher LR help overcome gradient attenuation due to gating or late insertion in the network?
Is weight_decay=0 used to avoid regularizing discrete n-gram memory embeddings (which may harm capacity)?
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use 5× learning rate and zero weight decay for Engram parameters? #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why use 5× learning rate and zero weight decay for Engram parameters? #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions