Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It
Modern language models are trained on data with extremely uneven token distributions. A small number of words appear in almost every sentence, while many rare but meaningful tokens occur only occasionally. This creates a hidden optimization challenge: parameters associated with common tokens receive constant gradient updates, while parameters tied to rare tokens may go hundreds […] The post Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It appeared first on MarkTechPost.