Skip to content

Why use 5× learning rate and zero weight decay for Engram parameters? #9

@goodluckcwl

Description

@goodluckcwl

In the current implementation, the Engram module appears to be trained with a higher learning rate than the backbone and weight_decay=0. Could you clarify the motivation behind these choices? Specifically:

Does the higher LR help overcome gradient attenuation due to gating or late insertion in the network?
Is weight_decay=0 used to avoid regularizing discrete n-gram memory embeddings (which may harm capacity)?
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions