I think that the major problem with CV is that it only recognizes images in isol...

I think that the major problem with CV is that it only recognizes images in isolation from each-other. Humans understand what they are looking at by finding the concept that lies at the intersection of all the small ideas in the image. For example, a human would recognize a keyboard because it contains a "means of input" on which there are "symbols" specifically the "alphabet", arranged in a "logical format"("QWERTYUIOP") which they know is the sign of a keyboard. If a human were to see a keybaord that looks different from most, they can still make the inference that it is a keyboard by understanding the underlying concepts of what they see.

On the other hand, a computer mechanically relates the specific format of a keyboard to the word "keyboard." It fuzzy matches the pixels of images to extract the object in the image: not the individual ideas implicit in the image.

Computer Vision needs more depth to actually be considered vision.