Hmm you note that the problem is the LLM doesn’t have enough image context, but ...

Hmm you note that the problem is the LLM doesn’t have enough image context, but then zoom the image more?

Why not downscale the image and feed it as a second input so that entire planets fit into a patch and instruct it to use the doensampled image for coarse coordinate estimation