LoRA Adapters are, to me, one of the smartest strategies used in Machine Learning in recent years! It is one of those things where I think, "Wait! How didn't we think about that before?"
LoRA adapters came as a very natural strategy for fine-tuning models. The idea is to realize that any matrix of model parameters in a neural network of a trained model is just a sum of the initial values and the following gradient descent updates learned on the training data mini-batches:
𝜃(trained model) = 𝜃(initial value) + gradient descent updates
From there, we can understand a fine-tuned model as a set of model parameters where we continued to aggregate the gradients further on some specialized dataset:
𝜃(fine-tuned) = 𝜃(trained model) + more gradient descent updates
When we realize that we can decompose the pretraining learning and the fine-tuning learning into those 2 terms:
𝜃(fine-tuned) = 𝜃(trained model) + ΔW
then we understand that we don't need that decomposition to happen into the same matrix; we could sum the output of 2 different matrices instead. That is the idea behind LoRA: we allocate new weight parameters that will specialize in learning the fine-tuning data, and we freeze the original weights.
As such, it is not very interesting because new matrices of model parameters would just double the required memory allocated to the model. So, the trick is to use a low-rank matrix approximation to reduce the number of operations and required memory. We introduce 2 new matrices A and B to approximate ΔW:
ΔW ~ BA
It is important to realize that, typically, the amount of training data used for fine-tuning is much smaller than the data used for pretraining. As a consequence, it is unlikely that we could even have enough data to get good statistical convergence on the full matrix ΔW. The low-rank approximation acts as a regularization technique that will help the model generalize better on unseen data.
--
👉 Learn more about ML in my newsletter https://newsletter.TheAiEdge.io
--
Comments
Post a Comment