Nowadays, most LLMs get trained with the AdamW optimizer as opposed to the Adam optimizer. Why?
There used to be a time when Adam was the king among optimizers, and it didn't make much sense to spend too much time trying to find a better one. This has changed recently, and AdamW has become the default optimizer for the LLM practitioners.
It all depends on how we apply the regularization terms to the weight parameters. For the typical gradient descent algorithm, if we want to apply an L2 regularization term, we modify the loss function such that:
regularized loss = loss + L2 term
Then, we compute the gradient of that new loss to update the model parameters. The goal of the regularization term is to ensure that the weights don't grow too large and it acts as a weight decay mechanism when we update the weights.
In Adam, when we apply the L2 regularization, we regularize the loss function as well but it gets used differently. The loss function is used to compute the first and second moments and when we update the weights, the regularization term is in the numerator and denominator of the gradient update term. Because of it, the effect of the L2 regularization term is minimized and cannot act as a weight decay mechanism.
In AdamW, on the other hand, we DO NOT regularize the loss function and compute the gradient update independent from the regularization term. Only during the weights update do we add the regularization term so that it acts on the weights and not on the loss function. Because of it, training with AdamW tends to be more stable and leads to models that generalize better! Good to know, right?
Comments
Post a Comment