language models require massive scale to train. But scale isn't only in the number of parameters or neurons. Scale also exists in the amount of data the model trains on.
While parameter size affects post training size and requirements to run. Data size does not. Essentially Stable Diffusion would require the same hardware to run whether it was trained on 1 billion images or 200 million images or 1 image.
Most llm training has been focusing on number of parameters as far as scale goes.
Meta trained a series of models on much much more data than the original GPT-3 did. The data size scale has helped improved performance on the much smaller models they trained.
The parent poster was talking about training longer but the model being kept at smaller scale so it would not be expensive to use in production. It's a trade-off, you could train shorter with a larger model.
While parameter size affects post training size and requirements to run. Data size does not. Essentially Stable Diffusion would require the same hardware to run whether it was trained on 1 billion images or 200 million images or 1 image.
Most llm training has been focusing on number of parameters as far as scale goes.
Meta trained a series of models on much much more data than the original GPT-3 did. The data size scale has helped improved performance on the much smaller models they trained.