> Number of native (preferably UTF-8) code units > The first is useful for basic...

josephg · on March 26, 2021

> Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory

I’ve used this for collaborative editing. If you want to send a change saying “insert A at position 10”, the question is: what units should you use for “position 10”?

- If you use byte offsets then you have to enforce an encoding on all machines, even when that doesn’t make sense. And you’re allowing the encoding to become corrupted by edits in invalid locations. (Which goes against the principle of making invalid state impossible to represent).

- If you use grapheme clusters, the positions aren’t portable between systems or library versions. What today is position 10 in a string might tomorrow be position 9 due to new additions to the Unicode spec.

The cleanest answer I’ve found is to count using Unicode codepoints. This approach is encoding-agnostic, portable, simple, well defined and stable across time and between platforms.

7786655 · on March 26, 2021

>calculating the total size of the string to allocate memory

In some languages this is 90% of everything you do with strings. In other languages it's still 90% of everything done to strings, but done automatically.

tedunangst · on March 26, 2021

Neither code points nor units help upper casing ß.