Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just a small detail that isn't mentioned in the article:

in NFC form, "base characters and modifiers are combined into a single rune whenever possible"

the interesting detail is "whenever possible": since NFC works by first decomposing, and then recomposing... there're some cases in which if you run NFC normalization on it, the characters will remain decomposed

an example is š…˜š…„š…® (U+1D160) which its normalized composed form is made of 3 different codepoints

I tried to look at the algorithm for generating the composition table, and it seems it's generated from the decomposition table... if that's so, I can't understand how it could happen that some code points have an NFC form longer than 1

more details: http://stackoverflow.com/questions/17897534/can-unicode-nfc-...

does anyone knows the cause behind this?



1. It's decompose, reorder, compose. So you can see some weird stuff like įøĢ‡=įø‹ā—‹Ģ£ → NFD=d○̣○̇ → NFD=įøā—‹Ģ‡

2. It's not compression, it's normalisation. So it's not compose everything you can. I cannot tell you exact the algorithm off the top of my head, but:

the reason for U+1D160 — it's in CompositionExclusions list.


Thanks, after looking up CompositionExclusions I discovered the rationale:

http://unicode.org/reports/tr15/#Primary_Exclusion_List_Tabl...

> When a character with a canonical decomposition is added to Unicode, it must be added to the composition exclusion table if there is at least one character in its decomposition that existed in a previous version of Unicode. If there are no such characters, then it is possible for it to be added or omitted from the composition exclusion table. The choice of whether to do so or not rests upon whether it is generally used in the precomposed form or not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: