Quick notes around Chinchilla Scaling Law/ Limits & beyond for DeepLearning and LLMs.
Factors
- Model size (N)
- Dataset size (D)
- Training Cost (aka Compute) (C)
- Test Cross-entropy loss (L)
The intuitive way,
- Larger data will need a larger model, and have higher training cost. In other words, N, D, C all increase together, not necessarily linearly, could be exponential, log-linear, etc.
- Likewise Loss is likely to increase for larger datasets. So an inverse relationship between L & D (& the rest).
- Tying them into equations would be some constants (scaling, exponential, alpha, beta, etc), unknown for now (identified later).
Beyond common sense, the theoretical foundations linking the factors aren't available right now. Perhaps the nature of the problem is it's hard (NP).
The next best thing then, is to somehow work out the relationships/ bounds empirically. To work with existing Deep Learning models, LLMs, etc using large data sets spanning TB/ PB of data, Trillions of parameters, etc using large compute budget cumulatively spanning years.
Papers by Hestness & Narang, Kaplan, Chinchilla are all attempts along the empirical route. So are more recent papers like Mosaic, DeepSeek, MoE, Llam3, Microsoft among many others.
Key take away being,
- The scale & bounds are getting larger over time.
- Models from a couple of years back, are found to be grossly under-trained in terms of volumes of training data used. They should have been trained on an order of magnitude larger training data for an optimal training, without risk of overfitting.
- Conversely, the previously used data volumes are suited to much smaller models (SLMs), with inference capabilities similar to those older LLMs.
References
- https://en.wikipedia.org/wiki/Neural_scaling_law
- https://lifearchitect.ai/chinchilla/
- https://medium.com/@raniahossam/chinchilla-scaling-laws-for-large-language-models-llms-40c434e4e1c1
- https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
- https://medium.com/nlplanet/two-minutes-nlp-scaling-laws-for-neural-language-models-add6061aece7
- https://lifearchitect.ai/the-sky-is-bigger/