> Cute, now do it with UTF-8 support. I see this type of sentiment a lot and I'm... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		_gabe_ on July 20, 2023 \| parent \| context \| favorite \| on: Simply Parse in C > Cute, now do it with UTF-8 support. I see this type of sentiment a lot and I'm not sure why this exists. Maybe it's because there were a bunch of formats in the past and it made it more difficult? Idk. Anyways, I finally decided to "bite the bullet" and prepared a solid week to finally do the "nitty gritty" of writing a UTF-8 validator/logging library. Turns out, it was super easy and took me like an hour to read through the RFC and maybe 2 more hours to write a simple implementation. For anyone that's curious, give it a read here[0], it's surprisingly readable and the format is very simple and elegant. I don't say simple as in dumb either, I say simple as in they made the problem as simple as it needs to be with no unneeded complexity, and it's a breath of fresh air. Also, it's written in such a way that any valid ASCII is valid UTF-8. So at the very least, you can just check if you encounter any bytes with the highest bit set in the string before parsing. If that's the case you can throw an error saying you don't support UTF-8 and avoid parsing potentially invalid data (not that it's particularly difficult to validate the UTF-8 if you want to). [0]: https://datatracker.ietf.org/doc/html/rfc3629#section-3

pjmlp on July 20, 2023 [–]

Have you validated that simple implementation with anything besides English?

_gabe_ on July 20, 2023 | [–]

Yes.

Let me know if you see any bugs[0], I'll add it to my regression tests.

[0]: https://github.com/ambrosiogabe/CppUtils/blob/master/single_...

pjmlp on July 20, 2023 | | [–]

Fair enough.

I was thinking about the usual cases where lowercase and uppercase character count don't match, non-latin character based languages and so on.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact