Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tokenization. Before text is fed into an LLM, it's processed into tokens typically consisting of multiple characters. GPT-4o would see the word "jokes" as the tokens [73, 17349]. That's much more efficient than processing individual characters, but it means that LLMs can't count letters or words without some additional trickery. LLMs have also struggled with arithmetic for the same reason - they can't "see" the numbers in the text, just tokens representing groups of characters.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: