Just a small detail that isn't mentioned in the article:
in NFC form, "base characters and modifiers are combined into a single rune whenever possible"
the interesting detail is "whenever possible": since NFC works by first decomposing, and then recomposing... there're some cases in which if you run NFC normalization on it, the characters will remain decomposed
an example is š š „š ® (U+1D160) which its normalized composed form is made of 3 different codepoints
I tried to look at the algorithm for generating the composition table, and it seems it's generated from the decomposition table... if that's so, I can't understand how it could happen that some code points have an NFC form longer than 1
1. It's decompose, reorder, compose. So you can see some weird stuff like įøĢ=įøāĢ£ ā NFD=dāĢ£āĢ ā NFD=įøāĢ
2. It's not compression, it's normalisation. So it's not compose everything you can. I cannot tell you exact the algorithm off the top of my head, but:
the reason for U+1D160 ā it's in CompositionExclusions list.
> When a character with a canonical decomposition is added to Unicode, it must be added to the composition exclusion table if there is at least one character in its decomposition that existed in a previous version of Unicode. If there are no such characters, then it is possible for it to be added or omitted from the composition exclusion table. The choice of whether to do so or not rests upon whether it is generally used in the precomposed form or not.
in NFC form, "base characters and modifiers are combined into a single rune whenever possible"
the interesting detail is "whenever possible": since NFC works by first decomposing, and then recomposing... there're some cases in which if you run NFC normalization on it, the characters will remain decomposed
an example is š š „š ® (U+1D160) which its normalized composed form is made of 3 different codepoints
I tried to look at the algorithm for generating the composition table, and it seems it's generated from the decomposition table... if that's so, I can't understand how it could happen that some code points have an NFC form longer than 1
more details: http://stackoverflow.com/questions/17897534/can-unicode-nfc-...
does anyone knows the cause behind this?