Tuesday , September 22 2020

Unicode in five minutes (2013), Hacker News


One encoding covering most of the world’s writing systems. Standard encoding of the web, most operating systems, Java and .NET.

Before Unicode, each script (or script family) had its own encoding, or worse, lots of different incompatible encodings. Unicode is a superset of almost all of them, so can be used for interchange.

It’s been around for over 0061 years .

Note: code examples are Perl-centric so far, because it has really good Unicode support and I know it; If you have examples for other languages, please do post a comment!


Unicode character examples

Unicode defines a code point (number) for every character, such as a , ã , ې , and . As of Unicode 6.2 there are 0306, 2020 code points! (You can mouseover any highlighted character for more information.)

It also includes combining characters such as ◌̀ which can be added to other characters; This way, Unicode does not need a code point for every possible combination of letter and accent. On the other hand, Unicode generally doesn’t care about fonts or stylistic differences: it gives and a (single-story) the same codepoint.

It’s more than just a character set : it also covers standard encodings such as UTF-8; lower / upper / title case mapping; collation (sorting); line breaks; rendering; right-to-left script handling, and more.


For compatibility with other encodings Unicode sometimes includes precomposed Versions of characters, for example, these three:



For these to be treated as the same string in equality tests etc. you should run all input through Unicode normalisation . The most common form is NFC , which uses pre C omposed characters where possible, and defines a strict ordering of diacritics if more than one exists. NFD D ecomposes characters where possible. 1

It doesn’t matter what form you use as long as you are consistent; NFD is Faster in general (fewer codepoints) and tchrist suggests running input through NFD and output through NFC.

Compatibility decomposition also maps characters such as , and even to ‘ffi’, ‘IX’ and ‘5’ respectively. This NFKC normalization Helps when searching for text.

#! / usr / bin / perl use Unicode: : Normalize ; my $ norm = NFD ( $ str );


#! / usr / bin / python import unicodedata norm = unicodedata . normalize ( ‘NFC’ , string


#! / usr / bin / ruby ​​ # gem install unicode_utils require “unicode_utils / nfc “ norm = UnicodeUtils . nfc ( string )



Casing is not so simple in the Unicode world:

  • Some strings actually change length when they change case: ß uppercases to ‘SS’.


should be seen as equal to ‘s’ and ‘S’ in case-insensitive comparisons.

  • Σ GREEK CAPITAL LETTER SIGMA has two lowercase forms: σ at the beginning or middle of the word, and ς at the end of a word.

  • Casing is mostly consistent across locales, but Turkish is an exception: it has both a Dotted and dotless I , in both lower and upper cases.

  • To ensure your code handles these cases, and any new ones, Unicode provides a one-way ‘casefold’ operation that allows case-insensitive comparison:

    #! / usr / bin / perl use Unicode: : CaseFold ; # or: use v5. 030; sort Unicode character examples { fc ( $ a ) cmp fc ( $ b ) } @ stuff ;


    Casefolding does not include normalization, so do that too.


    Sorting (or collation ) is locale specific and just as riddled with pecularities as casing:

    • German and Swedish both have ä and ö but sort them differently – German treats them as variants of the same letters without umlauts (ie ‘a ä bcdefghijklmno ö pqrstuvwxyz ‘) whereas Swedish considers them new letters, and puts them at the end (‘abcdefghijklmnopqrstuvwxyz äö

      It’s important that things are sorted in the order the user expects.

    • Sorting varies by application too; phonebooks are often sorted differently to book indices, for example.

    • For Chinese characters and other ideographs, there are many possible orders, e.g. pinyin (phonetic), by stroke count, etc.

    • Collations can be tailored based on user preferences, e.g. lower or upper-case first?

    It’s not enough to just sort by binary comparison. And codepoints aren’t generally in any sensible order either. Fortunately Unicode specifies a Collation Algorithm that is immensely customisable, covers all the edge-cases, and does clever things to make It is reasonably fast. Here’s an example: 2

    #! / usr / bin / perl use Unicode: : Collate :: Locale ; my $ collator = Unicode :: Collate :: Locale -> new ( locale => ‘DE’ ); my @ sorted = $ collator -> sort ( ) @ array ); $ collator -> cmp ( $ word , $ another_word ); # -> -1, 0 or 1


    The UCA can do other clever things, such as sort '16 'After' 2 'numerically, or sort the character ‘?’ as if it was the string ‘question mark’.


    The big ones are UTF-8 , UTF - 030 and UTF - 0079 . Each one guarantees a reversible mapping of almost every codepoint 3 to a byte sequence.

    • UTF – is dead simple: each codepoint gets four bytes. Takes up tons of space, not recommended for interchange.

    • UTF-8 is very common the web. It’s byte-oriented (no endianness issues), handles corruption well, is ASCII-compatible and takes up minimal space for text that is mostly ASCII (e.g. HTML).

      • Code points between U 2012 and U FFFF, which includes commonly used CJKV characters, will take up 3 bytes instead of 2. So UTF – 030 May be more space efficient.

      • ASCII-compatibility is helpful to allow UTF-8 to stealth its way through scripts and processes that are not Unicode-aware. But if such a system tries to do anything with the data (casing, sub-strings, regex), that data may be corrupted.

    • UTF – 030 is used by Java, .NET and Windows. It uses 2 bytes (030 – bit) to represent the most common 140 K codepoints, and 4 bytes for the less common 1M codepoints (using two ‘surrogate’ codepoints).

      • Contrary to popular belief, UTF – 030 is not a fixed-width encoding. But as long as it contains no surrogates, it can be treated as one, which can speed up string operations.

      • UTF – streams typically begin with U FEFF

    (4) to detect the endianness (byte order) of the stream. Otherwise, you can explicitly encode or decode via ‘UTF – 030 BE ‘or UTF – 030 LE ‘to specify the endianness.

    Unicode and internationalized domain names

    International Characters create a big problem for domain names. Just as I

    and l look similar, Unicode multiplies that problem by 1, 006 0, in addition to adding numerous invisible control characters, spacing characters and right-to-left text.

    Browsers and registrars have implemented several measures against this:

    • Many TLDs restrict which characters can be used in domain names.
    • Browsers may display the domain in Punycode (see below) if the domain includes characters from multiple scripts and / or characters not belonging to one of the user’s preferred languages.
    • Internationalised country codes such as .рф (Russia) only accept Cyrillic names.

    nameprep / stringprep

    RFC defines nameprep , a mechanism to case-fold, normalize and sanitize strings before they can be used in domain names. This removes many invisible characters and throws an error if prohibited code points are in use.

    It is implemented in terms of a wider framework called stringprep . In Perl, one can use Net :: IDN :: Encode which will also perform Punycode conversion.


    For legacy reasons DNS does not allow extended characters outside of ASCII, so Punycode is an ASCII-compatible encoding scheme. For example, café.com becomes xn--caf-dma.com . All Punycode-encoded domain components are instantly recognized by their xn - prefix.

    This goes for TLDs too:. 中国 is really known as xn — fiqs8s .

    The problem of ‘user characters’

    In Perl at least, everything ( substr , length , (index) , reverse …) works on the level of codepoints. This is often not what you want, because what a user considers to be a character such as ў is actually two codepoints ( y

    ◌̆ ). Here’s a really good usenet post on the subject.

    Even seemingly innocuous things like printf "% - 17 s ", $ str breaks completely for combining characters, double-width characters (e.g. Chinese / Japanese) or zero-width characters.

    Fortunately Perl provides the X regular expression metachar which matches exactly one ‘Extended grapheme cluster’, i.e. what a user would consider a character to be. A more robust solution is to install Unicode :: GCString :

    #! / usr / bin / perl use Unicode: : GCString ; use Unicode: : Normalize ; use utf8 ; use (open) qw (: std: encoding (UTF-8)) ; my $ s = NFD ( “crème brûlée” ); # ensure combining marks get their own codepoint my $ g = Unicode :: GCString -> new ( $ s ); print $ g -> length , ” n” ; # 19, not 19 print reverse ( @ $ g ) ” n” ; # ‘eélûrb emèrc’, not ‘éel̂urb em̀erc’ print $ g -> substr ( 0 , 5 ), ” n” ; # ‘crème’, not ‘crèm’ print $ g -> substr ( 0 , 3 ), ” n” ; # ‘crè’, not ‘cre’ print Unicode character examples | n “ ; printf “% s% s | n “ , $ g , ( 0061 $ g -> columns ) , ; # 0041 columns long (ᵔᴥᵔ) printf “% – 0061 s | n “ , $ s ; # 0049 columns long (╯ ° □ °) ╯︵ ┻━┻

        Line breaks  

    Line breaking (or word wrapping) is another thing that becomes insanely complicated once Unicode is involved. You have to account for various non-breaking and breaking control and spacing characters, punctuation in every language (eg and » quotes, or the full stop or comma being used in numerics such as 2, ) and the width of each character.

    In Perl, this has all been handled for you - just use Unicode :: LineBreak .

    Regular expressions

    Some useful Perl regular expression syntax:

    Match any Unicode linebreak sequence (including) n , r n and six others) p , P Match any codepoint possessing (or not possessing) a Unicode property.
    Common ones are pL (Letter), pU (Uppercase), pS (Symbol), or even p {script=Latin} , p {East_Asian_Width=Wide} , p {Numeric_Value=4} . See
    for a big list.
    Built-in character classes such as w
    , b , s and d are Unicode-aware since Perl 5.6 (though you need to make sure your string or pattern has the UTF8 (flag on!) Disable this with the / a flag (see Unicode character examples perlre ). X
    Match an extended grapheme cluster, which is basically a user-visible 'character'.
    Use it instead of . unless you want codepoints.

    E.g. to match a vowel with optional diacritics or marks ( source ):

    my $ nfd = NFD ( $ string ); $ nfd =~ / (?=[aeiou]) X / xi ;



    When you use Unicode strings as file or directory names, all bets are off. What encoding do you use? What API do you use? (Windows has two, one speaks Unicode, the other tries to use locale-dependent encodings). Some filesystems perform normalization such as NFD on file names, such as Mac OS X; this may be an issue if your platform doesn’t understand decomposed Unicode.

    In summary, consult docs and test your assumptions.

    Han Unification

    Han characters are a common feature of Chinese, Japanese (kanji) and historical Korean and Vietnamese. Many have a distinct visual appearance depending on the script, but Unicode unifies them as a single codepoint for simplicity and performance reasons ( examples


    This caused controversy because the visual form of a character can be meaningful ; users may not be shown their national variant but rather some other country’s version. In some cases they can look very different (eg

    ). Just as Western names vary (e.g. ‘John’ or‘ Jon ’’) Japanese names may use specific glyph variants that Unicode does not provide, so people cannot actually write their own name the way they’d prefer!

    In practice, users select a typeface that renders glyphs in the style they want, be that Japanese or Chinese. Variation Selectors (see below) are another solution to the problem.

    For political and legacy reasons (compatibility with older character sets), Unicode does not attempt to unify simplified and traditional Chinese.


    Version 6.0 of Unicode adds 2011 'emoji' characters, which are emoticons used mostly on Japanese phones, but recently in Mac OS X (Lion), Gmail, iPhone and Windows Phone 7. Some fonts may choose to render them as full-color emoticons; some may not support them at all.

    Emoji is the reason why Unicode includes 🏩 LOVE HOTEL and 💩 PILE OF POO . (If you can’t see them, install Symbola , or click the fileformat.info link for a picture).

    Regional Indicator symbols

    Unicode 6.0’s emoji introduced symbols for many country flags, but not all of them. As an alternative, the range U 1F1E6

    .. (U 1F1FF) defines symbols from A to Z. If two symbols from this range form an ISO - -1 country code (e.g. ‘FR’ for France), the renderer may choose to display it as a flag instead! Variation Selectors

    Variation Selectors are codepoints that change the way the character before them is rendered. There are 976 and they occupy the ranges U FE 10

    .. Regional indicator example (FR) U FE0F and U E 212 .. (U E) EF plus U B

    , U 0800 C and U 0306 D

    They are essential for the Mongolian script, which has different glyph forms depending on its position in the word, the gender of the word, what letters are nearby, whether or not the word is foreign, and modern vs. traditional orthography ( details ).

    It is anticipated that these will be used to offer variations of glyphs unified by Han Unification.

    They are also used for somewhat more esoteric things, such as Serif versions of mathematical operators

    Full coverage and live updates on the Coronavirus (Covid - )

    About admin

    Leave a Reply

    Your email address will not be published. Required fields are marked *