Currently, the learning rate decay happens after each iteration and the update rule is
lr = config.lr/(1 + args.lr_decay*step)
So, the learning rate of step 0 and 1 will be the same value config.lr.
Is this the expected behavior? Or, the following is correct
lr = config.lr/(1 + args.lr_decay*(step+1))