Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Let me clarify. I was too vague and definitely did not express things accurately. That is on me.

We have the math to show that it can be impossible to distinguish two explanations through data processing alone. We have examples of this in science, a long history of it in fact. Fundamentally there is so much that we cannot conclude from processing data alone. Science (the search of knowledge) is active. It doesn't require just processing existing data, it requires the search for new data. We propose competing hypotheses that are indistinguishable from the current data and seek out the data which distinguishes them (a pain point for many of the TOEs like String Theory). We know that data processing alone is insufficient for explanation. We know it cannot distinguish confounders. We know it cannot distinguish causal graphs (e.g. distinguish triangular maps. We are able to create them, but not distinguish them through data processing alone). The problem with scaling alone is that it makes the assertions that data processing is enough. Yet we have so much work (and history) telling us that data processing is insufficient.

The scaling math itself also shows a drastic decline in performance with scale and often do not suggest convergence even with infinite data. They are power laws with positive concavity, requiring exponential increase in data and parameters for marginal improvements on test loss. I'm not claiming that we need zero test loss to reach AGI, but the results do tell us that if this is strongly correlated then we'll need to spend an exponential amount more to achieve AGI even if we are close. By our measures, scaling is not enough unless we are sufficiently close. Even our empirical results align with this as despite many claiming that scale is all we need, we are making significant changes to the model architectures and training procedures (including optimizers). We are making these large changes because throwing the new data at the old models (even when simply increasing the number of parameters) does not work out. It is not just the practicality, it is the results. The scaling claim has always been a myth used to drive investments since it is a nice simple story that says that we can get there by doing what we've already been doing, just more. We all know that these new LLMs aren't dramatic improvements off their previous versions, despite being much larger, more efficient, and having processed far more data.

[side note]: We even have my namesake who would argue that there are truths which are not provably true with a system that is both consistent and efficient (effectively calculable). But we need not go that far, as omniscience is not a requirement for AGI. Though it is worth noting for the limits of our models, since at the core this matters. Changing our axioms changes the results, even with the same data. But science doesn't exclusively use a formal system, nor does it use a single one.



My apologies for the much delayed reply as I have recently found myself with little extra time to post adequate responses. Your critiques are very interesting to ponder, so I thank you for posting them. I did want to respond to this one though.

I believe all of my counterarguments center around my current viewpoint that given the rapid rate of progress involved on the engineering side, it is no longer reasonable in deep learning theory to consider what is possible, and it is more interesting to try to outline hard limitations. This emposes a stark contrast between deep learning and classical statistics, as the boundaries in the latter are very clear and are not shared by the former.

I want to stress that at present, nearly every conjectured limitation of deep learning over the last several decades has fallen. This includes many back of the napkin, "clearly obvious" arguments, so I'm wary of them now. I think the skepticism all along has been fueled in response to hype cycles, so we must be careful not to make the same mistakes. There is far too much empirical evidence available to counter precise arguments against the claim that there is an underlying understanding within these models, so it seems we must resort to the imprecise to continue the debate.

Scaling, along one axis, suggests a high polynomial degree of additional compute (not exponential) is required for increasing improvements, this is true. But the progress over the last few years has occurred due to the discovery of new axes to scale on, which further reduces the error rate and improves performance. There are still many potential axes left untapped. What is significant about scaling to me is not how much additional compute is required, but the fact that the predicted bottom at the moment is very, very low, far lower than anything else we have ever seen, and that doesn't require any more data than we currently have. That should be cause for concern until we find a better lower bound.

> We all know that these new LLMs aren't dramatic improvements off their previous versions

No, I don't agree. This may be evident to many, but to some, the differences are stark. Our perceived metrics of performance are nonlinear and person-dependent, and these major differences can be imperceptible to most. The vast majority of attempts at providing more regular metrics or benchmarks that are not already saturated have shown that LLM development is not slowing down by any stretch. I'm not saying that LLMs will "go to the moon". But I don't have anything concrete to say they cannot either.

> We have the math to show that it can be impossible to distinguish two explanations through data processing alone.

Actually, this is a really great point, but I think this highlights the limitations of benchmarks and the requirements of capacity-based, compression-based, or other types of alternative data-independent metrics. With these in tow, it can be possible to distinguish two explanations. This could be a fruitful line of inquiry.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: