Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory. Basic string operations, such as converting to uppercase, would operate on codepoints, regardless of how many code units are used to encode that codepoint.
Converting 'Á' to 'á', for example, is an operation on one codepoint but multiple code units.
> Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory
I’ve used this for collaborative editing. If you want to send a change saying “insert A at position 10”, the question is: what units should you use for “position 10”?
- If you use byte offsets then you have to enforce an encoding on all machines, even when that doesn’t make sense. And you’re allowing the encoding to become corrupted by edits in invalid locations. (Which goes against the principle of making invalid state impossible to represent).
- If you use grapheme clusters, the positions aren’t portable between systems or library versions. What today is position 10 in a string might tomorrow be position 9 due to new additions to the Unicode spec.
The cleanest answer I’ve found is to count using Unicode codepoints. This approach is encoding-agnostic, portable, simple, well defined and stable across time and between platforms.
>calculating the total size of the string to allocate memory
In some languages this is 90% of everything you do with strings. In other languages it's still 90% of everything done to strings, but done automatically.
> The first is useful for basic string operations
Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory. Basic string operations, such as converting to uppercase, would operate on codepoints, regardless of how many code units are used to encode that codepoint.
Converting 'Á' to 'á', for example, is an operation on one codepoint but multiple code units.