There's a pile of work on multimodal inputs to LLMs, generally finding that less training data is needed as image (or other) data is added to training.
Text is an extremely limited input stream, but an input stream nonetheless. We know that animal intelligence works well enough with any of a range of sensory streams, and different levels of emphasis on those streams - humans are somehow functional despite a lack of ultrasonic perception and primitive sense of smell.
And your definition of a concept is quite self-serving... I say that as a mathematician familiar with many concepts which don't map at all to sensory motor experiences.
Sensory-motor expression of concepts is primitive, yes, they become abstracted --- and yes the semantics of those abstractions can be abstract. I'm not talking semantics, i'm talking genesis.
How does one generate representations whose semantics are the structure of the world? Not via text token frequency, this much is obvious.
I dont think the thinnest sense of "2 + 2 = 4" being true is what a mathematician understands -- they understand, rather, the object 2, the map `+` and so on. That is, the proposition. And when they imagine a sphere of radius 4 containing a square of length 2, etc. -- I think there's a 'sensuous, mechanical, depth' that enables and permeates their thinking.
The intellect is formal only in the sense that, absent content, it has form. That content however is grown by animals at play in their environment.
Text is an extremely limited input stream, but an input stream nonetheless. We know that animal intelligence works well enough with any of a range of sensory streams, and different levels of emphasis on those streams - humans are somehow functional despite a lack of ultrasonic perception and primitive sense of smell.
And your definition of a concept is quite self-serving... I say that as a mathematician familiar with many concepts which don't map at all to sensory motor experiences.