Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tokenization is purely an implementation detail. If OpenAI had cared, they could have deleted those obviously glitched tokens from their tokenizer. They just didn't inspect it carefully and/or care.

GPT4 does not suffer from the same glitched tokens as GPT3, presumably because it uses a different tokenizer.

Furthermore, there are LLMs that operate on single bytes instead of multi-character tokens, totally obviating that problem.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: