Tokenization is purely an implementation detail. If OpenAI had cared, they could have deleted those obviously glitched tokens from their tokenizer. They just didn't inspect it carefully and/or care.
GPT4 does not suffer from the same glitched tokens as GPT3, presumably because it uses a different tokenizer.
Furthermore, there are LLMs that operate on single bytes instead of multi-character tokens, totally obviating that problem.
GPT4 does not suffer from the same glitched tokens as GPT3, presumably because it uses a different tokenizer.
Furthermore, there are LLMs that operate on single bytes instead of multi-character tokens, totally obviating that problem.