-
Couldn't load subscription status.
- Fork 100
Description
Currently, we wrote:
neural-fortran/src/nf/nf_optimizers.f90
Lines 188 to 190 in 6adc1c2
| param = param & | |
| - self % learning_rate * m_hat / (sqrt(v_hat) + self % epsilon) & | |
| - self % weight_decay_decoupled * param |
However, I'm looking at the paper and PyTorch docs again.
In the paper, in Algorithm 2, line 12, we have self % weight_decay_decoupled * param) multiplied by the schedule multiplier
In the PyTorch docs for AdamW,
I looked at Keras source and I don't even see where and if the weight decay is even used (??).
@Spnetic-5 do you also see the same discrepancy between the paper and the PyTorch docs like I do?
If yes, I suggest that we multiply it with the learning rate in our code as well. I trust more that the PyTorch implements it correctly than that we are correctly interpreting the paper (and papers have typos, of course).