Training cost has increased a ton exactly because inference cost is the biggest problem: models are now trained on almost three orders of magnitude more data then what is compute-optimal to do (from the Chinchilla paper), because saving compute on inference makes it valuable to overtrain a smaller model to achieve similar performance for a bigger amount of training compute.
Interesting. I understand that, but I don't know to what degree.
I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.
It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.