True for the GP's suggestion of a varint encoding, but also true for UTF-8 (whic...

True for the GP's suggestion of a varint encoding, but also true for UTF-8 (which is a varint encoding.) So that's not much of a loss; we're already biting this bullet.

Still, though, you could have a fixed-size encoding that could still be more compact than UTF-8, if you limited what it could encode (and then held either it, or UTF-8 text, in a tagged union, as an ADT wrapped with an API of string operations that will implicitly "promote" your limited encoding to UTF-8 if the other arg is UTF-8, the same way integers get "promoted" to floats when you math them together.)

Then your limited-encoding text could hold and manipulate e.g. ASCII, or Japanese hiragana and katakana, or APL, or whatever else your system mostly holds, as a random-access array of single-octet codepoints; until something outside of that stream comes up, at which point you get UTF-8 text instead and your random-access operations become shimmed by seq-scans.

(Or you get a rope with both UTF-8 strings and limited-encoding strings as leaf nodes!)

Of course, if you didn't catch it, I'm talking about going back to having code pages. :) Just, from a perspective where everything is "canonically" UTF-8 and code pages are an internal optimization within your string ADT; rather than everything "canonically" being char[] of the system code page.