Image/video generation could possibly be used to advance LLMs in quite a substan...

Image/video generation could possibly be used to advance LLMs in quite a substantial way:

If the LLM during it's "thinking" phase encountered a scenario where it had to imagine a particular scene (let's say a pink elephant in a hotel lobby), then it could internally generate that image and use it to aid in world-simulation / understanding.

This is what happens in my head at least!