The difference between Keras SGD and tf.train.MomentumOptimizer

There was a discussion about whether we can replace the `tf.compat.v1.train.MomentumOptimizer` with `tf.keras.optimizers.SGD`. To find the difference, I read the document and trace the source code.

The update rule of [SGD](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) is:
vt+1 &larr; &alpha;vt - &eta;&nabla;f(&theta;t)
&theta;t+1 &larr; &theta;t + vt+1

And that of [MomentumOptimizer](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/MomentumOptimizer) is:
vt+1 &larr; &alpha;vt - &nabla;f(&theta;t)
&theta;t+1 &larr; &theta;t + &eta;vt+1

The difference is that the learning rate is multiplied at the first step in SGD while it is multiplied at the second step in MomentumOptimizer.

If the learning rate &eta; is a constant, the two formula are mathematically equivalent but there might be some floating point errors. However, if the learning rate is changing, the results will be different. Hope this could answer the question.

BTW, I find that the update rule in the slides is the same as SGD instead of MomentumOptimizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The difference between Keras SGD and tf.train.MomentumOptimizer #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The difference between Keras SGD and tf.train.MomentumOptimizer #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions